You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.
A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
However, the marketing cost was quite high because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.
To predict which customer is more likely to purchase the newly introduced travel package.
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder as ohe # this will allow us to code categorical data in binary format
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
#Loading dataset
data=pd.read_csv("tourism.csv")
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
data.shape
(4888, 20)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.00 | 36.0 | 44.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.00 | 13.0 | 20.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.00 | 22347.0 | 25571.00 | 98678.0 |
We can see from the .shape function that there are 4888 rows in the dataset. The describe function shows that not all the counts = 4888; which means that there must be null values in the dataset. we also see that many of the variables are coded in binary so there are a lot of zeros. This is acceptable for the categorical columns. It appears that ProdTaken, Passport and OwnCar are in binary format. CityTier, NumberOfPersonVisiting, NumberOfFollowups, PreferredPropertyStar, PitchSatisfactionScore, NumberOfChildrenVisiting are categorical variables. Age, DurationOfPitch, NumberOfTrips appear to be discrete numerical variables, while MonthlyIncome is continuous. CustomerID appears to be a unique identifier. We will have to find if there are any 0s in the continuous data and decide whether or not to replace with different values (ie median or mean).
data.isnull().sum() # this will tell us if there are missing values in an of the columns
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
dupes = data.duplicated() #This will check to see if any of the rows are duplicates
sum(dupes)
0
There does not appear to be any duplicate rows. At this point, we will not any drop rows, but we may need to if there are rows with important information missing that cannot be easily replaced. Lets look at the datatypes of each column...
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
There are 20 categories of data and 4888 rows. Three additional columns appeared using .info() that were not displayed using .describe. This is because they are in object format, meaning they are also categorical variables. Before we make any changes to the dataset, lets create a copy called 'TD' - short for Tourism Data - this will allow us to keep the original dataset in case we need to recall it for some reason and gives us a shorthand name for easy manipulation.
TD = data.copy() # This creates a copy of the dataframe
The first step is to remove the columns that do not provide utility. One such column is ID. ID is a unique key, but considering that it is simply a numbering for the clients, in order, from 200000 to 204888, we can remove this column and rely on the index instead to serve the same purpose. This way, python will not perform unuseful analysis on this column (see .describe() above, for example).
TD.drop('CustomerID',axis=1,inplace=True) # this drops the ID column since it is identical to the python index and is no longer needed
Next, lets convert all the categorical data columns to 'category' types...
cat_cols = ['ProdTaken', 'Passport', 'OwnCar', 'CityTier', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'PitchSatisfactionScore', 'Designation', 'NumberOfChildrenVisiting'] TD[cat_cols] = TD[cat_cols].astype('category') TD.info()
Our next step is to rename some of the columns. Many of the column names are long and can be shortened for easier EDA...
TD.rename(columns={'TypeofContact':'Contact'}, inplace=True) #personal preference, shortening the column names
TD.rename(columns={'CityTier':'City'}, inplace=True)
TD.rename(columns={'DurationOfPitch':'Duration'}, inplace=True)
TD.rename(columns={'NumberOfPersonVisiting':'Visitors'}, inplace=True)
TD.rename(columns={'NumberOfFollowups':'Followups'}, inplace=True)
TD.rename(columns={'ProductPitched':'Product'}, inplace=True)
TD.rename(columns={'PreferredPropertyStar':'Stars'}, inplace=True)
TD.rename(columns={'NumberOfTrips':'Trips'}, inplace=True)
TD.rename(columns={'PitchSatisfactionScore':'Score'}, inplace=True)
TD.rename(columns={'OwnCar':'Car'}, inplace=True)
TD.rename(columns={'NumberOfChildrenVisiting':'Children'}, inplace=True)
TD.rename(columns={'MonthlyIncome':'Income'}, inplace=True)
TD.head(1)
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
ProdTaken
TD.ProdTaken.describe()
count 4888.000000 mean 0.188216 std 0.390925 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 Name: ProdTaken, dtype: float64
We already knew that there are no missing values in this variable from above. We can see there are only two types of unique variables. 0 is the most frequent, at 3968 so there must be 920 1s or 'yeses'. The variable is binary and answers the question: "Did the customer purchase a package?" (0: No, 1: Yes)
Age
TD.Age.describe()
count 4662.000000 mean 37.622265 std 9.316387 min 18.000000 25% 31.000000 50% 36.000000 75% 44.000000 max 61.000000 Name: Age, dtype: float64
TD['Age'].isnull().sum()
226
There are 226 missing Age values. We know the mean is 37.6 and the range is between 18 and 61. We will hold off on replacing the missing values right now until we get a better sense of all the data. For instance, there could be rows with many missing values but if we start replacing missing values one at a time with column means or medians, we may not discover this. If a row is missing a lot of important information, it may be best to just drop the row.
Contact
TD.Contact.describe()
count 4863 unique 2 top Self Enquiry freq 3444 Name: Contact, dtype: object
# This will show us the 2 unique contacts and their frequencies
TD["Contact"].value_counts()
Self Enquiry 3444 Company Invited 1419 Name: Contact, dtype: int64
TD['Contact'].isnull().sum()
25
Contact type only has 2 distinct values. There are 25 missing. We will hold off on cleaning missing values until later for reasons mentioned above.
City
TD.City.describe()
count 4888.000000 mean 1.654255 std 0.916583 min 1.000000 25% 1.000000 50% 1.000000 75% 3.000000 max 3.000000 Name: City, dtype: float64
We can see there are no missing values in the City variable. We know there are three unique values that represent city tiers. However, there is a problem because the variable is described as follows: "City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3". The problem lies in the fact that python does not understand that Tier 1 is more important that Tier 2, it will look at the numerical representation of the tiers and order them from highest to lowest in importance. This means, it will rank 3 higher than 2 higher than 1. This is the opposite order that management intends. We shall have to rename the values so that python will rank them in importance the way they are meant to be ranked. To avoid confusion, however, we will use the following numbers:
-1 = Tier 3
0 = Tier 2
1 = Tier 1
This will ensure that Tier 1 is ranked higher than Tier 2 which is higher than Tier 3 and hopefully reduce mistakes that could occur if we simply switched 1 with 3 and 3 with 1.
TD['City'] = TD['City'].map({3: -1, 2: 0, 1: 1
})
#Changes the categorical city values to correctly ranked labels
# This will show us the 3 unique city values and their frequencies
TD["City"].value_counts()
1 3190 -1 1500 0 198 Name: City, dtype: int64
After the relabeling of the city values, we see that most of the city types are Tier 1, followed by Tier 2 with about half as many, and in a distant third is Tier 3. Most customers live in highly developed urban centres.
Duration
TD.Duration.describe()
count 4637.000000 mean 15.490835 std 8.519643 min 5.000000 25% 9.000000 50% 13.000000 75% 20.000000 max 127.000000 Name: Duration, dtype: float64
TD['Duration'].isnull().sum()
251
The average pitch lasts about 15.5 minutes but ranges from 5 minutes to 127 minutes. We can see there are 251 missing values, which we will deal with later. There is also a huge difference between the 75% quadrile and the maximum. This means there is at least one outlier, which could be a mistake. We will make note of this and examine in more closely in the Univariate Analysis portion.
Occupation
TD.Occupation.describe()
count 4888 unique 4 top Salaried freq 2368 Name: Occupation, dtype: object
# This will show us the 4 unique occupation values and their frequencies
TD["Occupation"].value_counts()
Salaried 2368 Small Business 2084 Large Business 434 Free Lancer 2 Name: Occupation, dtype: int64
TD['Occupation'].isnull().sum()
0
We can see the breakdown of the 4 unique Occupation values. Most customers are salaried or own small businesses. A smaller percentage own larger businesses. There are 2 Free Lancers. These occupations may need to be dropped since they represent such a small part of the sample. We will decide during Univariate Analysis.
Gender
TD.Gender.describe()
count 4888 unique 3 top Male freq 2916 Name: Gender, dtype: object
There appears to be no missing values in the Gender variable, however, there are 3 unique labels. This could be correct, depending on what the third label is (ie 'prefer not to say'), but let's check to be sure...
# This will show us the 3 unique genders and their frequencies
TD["Gender"].value_counts()
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64
The third label is Fe Male. We can assume this is incorrect and should probably be converted to Female. Let's do this now...
TD['Gender'] = TD['Gender'].map({'Male': 'Male', 'Female': 'Female', 'Fe Male': 'Female'
})
#Changes the categorical gender values to their corrected labels
TD['Gender'].isnull().sum()
0
We have reclassified Fe Male as Female. There are no missing values. It appears that there are almost 1/5th more males than females in the group (or 20%).
Visitors
TD.Visitors.describe()
count 4888.000000 mean 2.905074 std 0.724891 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 5.000000 Name: Visitors, dtype: float64
# This will show us the 5 unique visitors and their frequencies
TD["Visitors"].value_counts()
3 2402 2 1418 4 1026 1 39 5 3 Name: Visitors, dtype: int64
TD['Visitors'].isnull().sum()
0
We can see that there are 5 unique values for Visitors. This represents how many total people are going on the tripe. We can see most of the data lies between 2-4. There are no missing values. We will examine the shape of the data more closely and make a decision on the small 1 and 5 values during Univariate Analysis
Followups
TD.Followups.describe()
count 4843.000000 mean 3.708445 std 1.002509 min 1.000000 25% 3.000000 50% 4.000000 75% 4.000000 max 6.000000 Name: Followups, dtype: float64
# This will show us the 5 unique followups and their frequencies
TD["Followups"].value_counts()
4.0 2068 3.0 1466 5.0 768 2.0 229 1.0 176 6.0 136 Name: Followups, dtype: int64
TD['Followups'].isnull().sum()
45
There are 45 missing values, which we will deal with below. Most of the data falls under 3 and 4 followups. We will check later in Univariate Analysis to see if the data is bell-shaped.
Product
TD.Product.describe()
count 4888 unique 5 top Basic freq 1842 Name: Product, dtype: object
We can see that there are 5 unique product names and that there are no missing values. Let's see what the product names are...
# This will show us the 5 unique products and their frequencies
TD["Product"].value_counts()
Basic 1842 Deluxe 1732 Standard 742 Super Deluxe 342 King 230 Name: Product, dtype: int64
ASSUMPTION:
This variable looks like it should be ordered. We will assume that the order (from least to most important) is as follows: Basic, Standard, Deluxe, Super Deluxe, King. Python will only know that this is an ordered categorical variable if we change the names to correspond to their importance level. We will use the numbers 1 through 5...
TD['Product'] = TD['Product'].map({'Basic': 1, 'Standard': 2, 'Deluxe': 3, 'Super Deluxe': 4, 'King': 5
})
#Changes the categorical product values to correctly ranked labels
TD['Product'].isnull().sum()
0
We see that there are no missing values. We have also changed the labels so that they are a ranked category now, instead of being just labels. Python would not have understood they are ordered in importance without this step and treated them as independent.
Stars
TD.Stars.describe()
count 4862.000000 mean 3.581037 std 0.798009 min 3.000000 25% 3.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: Stars, dtype: float64
# This will show us the 3 unique stars and their frequencies
TD["Stars"].value_counts()
3.0 2993 5.0 956 4.0 913 Name: Stars, dtype: int64
TD['Stars'].isnull().sum()
26
We can see there are 26 missing values, which we will deal with later. We can also see that there are just three values, ranging from 3 to 5. Leaving them as numerals will ensure python ranks them, with 5 being the most important, which is proper.
Marital Status
TD.MaritalStatus.describe()
count 4888 unique 4 top Married freq 2340 Name: MaritalStatus, dtype: object
# This will show us the 5 unique Marital Status and their frequencies
TD["MaritalStatus"].value_counts()
Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64
TD['MaritalStatus'].isnull().sum()
0
We have zero missing values and 4 unique labels. However, there could be a problem if the labels are not, in fact, unique. For example, is Single and Unmarried synonymous and used interchangeably? Or are they distinct terms that give additional value? Single could me unattached, while unmarried could mean having a partner but not married. It could also be argued that Divorced is not significantly different from Single. Before we decide to combine any values, lets perform Univariate Anlysis first. If we notice high correlations, etc. we may then decide to transform some of the data.
Trips
TD.Trips.describe()
count 4748.000000 mean 3.236521 std 1.849019 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 22.000000 Name: Trips, dtype: float64
TD['Trips'].isnull().sum()
140
We see that there are 140 missing values for trips. The mean is 3.23 and the range is 1 to 22. Again, there is a big gap between the 75% quadrile and the max, suggesting the presence of outliers. We will perform Univariate Analysis before transforming this variable.
Passport
TD.Passport.describe()
count 4888.000000 mean 0.290917 std 0.454232 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 1.000000 Name: Passport, dtype: float64
# This will show us the 2 unique passport and their frequencies
TD["Passport"].value_counts()
0 3466 1 1422 Name: Passport, dtype: int64
TD['Passport'].isnull().sum()
0
There are no missing values. Passport is a binary variable, answering the question: "Does the customer have a passport?". Almost twice as many customers say yes
Score
TD.Score.describe()
count 4888.000000 mean 3.078151 std 1.365792 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: Score, dtype: float64
# This will show us the 5 unique score and their frequencies
TD["Score"].value_counts()
3 1478 5 970 1 942 4 912 2 586 Name: Score, dtype: int64
TD['Score'].isnull().sum()
0
Ther are no missing values. Most customers prefer 3 stars, while 5,1 and 4 seem evenly represented. Fewer customers prefer 2 stars.
Car
TD.Car.describe()
count 4888.000000 mean 0.620295 std 0.485363 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 Name: Car, dtype: float64
# This will show us the 2 unique car and their frequencies
TD["Car"].value_counts()
1 3032 0 1856 Name: Car, dtype: int64
TD['Car'].isnull().sum()
0
There are no missing values. This category is binary and asks the question: "Does the customer own a car?" Almost twice as many customers own their own vehicle.
Children
TD.Children.describe()
count 4822.000000 mean 1.187267 std 0.857861 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 3.000000 Name: Children, dtype: float64
# This will show us the 4 unique children and their frequencies
TD["Children"].value_counts()
1.0 2080 2.0 1335 0.0 1082 3.0 325 Name: Children, dtype: int64
TD['Children'].isnull().sum()
66
There are 66 missing values. Most customers have 1 child. The range is from 0 to 4 children.
Designation
TD.Designation.describe()
count 4888 unique 5 top Executive freq 1842 Name: Designation, dtype: object
# This will show us the 4 unique children and their frequencies
TD["Designation"].value_counts()
Executive 1842 Manager 1732 Senior Manager 742 AVP 342 VP 230 Name: Designation, dtype: int64
TD['Designation'].isnull().sum()
0
Income
TD.Income.describe()
count 4655.000000 mean 23619.853491 std 5380.698361 min 1000.000000 25% 20346.000000 50% 22347.000000 75% 25571.000000 max 98678.000000 Name: Income, dtype: float64
TD['Income'].isnull().sum()
233
There are 233 missing values. Monthly income ranges from 1000 to 98,678. The max is a lot higher than the 75% quadrile so there are bound to be outliers. The max income equates to almost $1M per year. This could be correct, especially for customers who own big businesses, but we will try to make sense of the outliers during Univariate and Bivariate Analysis.
def histogram_boxplot(data, xlabel=None, title=None, font_scale=2, figsize=(15,7), bins=None):
mean = np.mean(data)
sns.set(font_scale=font_scale) # setting the font scale of the seaborn
f2, (ax_box2, ax_hist2) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)}, figsize=figsize)
sns.boxplot(data, ax=ax_box2, showmeans=True,color="violet") #boxplot will be created and a star will indicate the mean value
sns.distplot(data,kde=False, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(data,kde=False, ax=ax_hist2, bins=None)
ax_hist2.axvline(mean, color='g', linestyle='--') #mean will show as vertical line in the histogram
if xlabel: ax_hist2.set(xlabel=xlabel) #xlabel
if title: ax_box2.set(title=title) # title of the graph
plt.show() # for plotting the graph
def perc_on_bar(plot, feature):
'''
plot
feature: categorical feature
the function wont work if a colume is passed in hue parameter
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # height of the plot
ax.annotate(percentage, (x,y), size = 20) # annotate the percentage
plt.show() # shows the plot
Product Taken
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['ProdTaken'], palette='winter')
perc_on_bar(ax,TD['ProdTaken'])
Age
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Age'], palette='winter')
median = np.median(TD.Age) # find the median income of all customers
mean = np.mean(TD.Age) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 37.62226512226512 The median equals nan
Contact
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Contact'], palette='winter')
perc_on_bar(ax,TD['Contact'])
City
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['City'], palette='winter')
perc_on_bar(ax,TD['City'])
Duration
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Duration'], palette='winter')
histogram_boxplot(TD['Duration'])
TD.loc[TD['Duration'] >= 40]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | NaN | Company Invited | -1 | 126.0 | Salaried | Male | 2 | 3.0 | 1 | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
| 3878 | 0 | 53.0 | Company Invited | -1 | 127.0 | Salaried | Male | 3 | 4.0 | 1 | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
median = np.median(TD.Duration) # find the median income of all customers
mean = np.mean(TD.Duration) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 15.490834591330602 The median equals nan
Occupation
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Occupation'], palette='winter')
perc_on_bar(ax,TD['Occupation'])
Gender
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Gender'], palette='winter')
perc_on_bar(ax,TD['Gender'])
Visitors
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Visitors'], palette='winter')
perc_on_bar(ax,TD['Visitors'])
Follow Ups
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Followups'], palette='winter')
perc_on_bar(ax,TD['Followups'])
Product
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Product'], palette='winter')
perc_on_bar(ax,TD['Product'])
Stars
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Stars'], palette='winter')
perc_on_bar(ax,TD['Stars'])
Marital Status
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['MaritalStatus'], palette='winter')
perc_on_bar(ax,TD['MaritalStatus'])
Trips
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Trips'], palette='winter')
perc_on_bar(ax,TD['Trips'])
histogram_boxplot(TD['Trips'])
TD["Trips"].value_counts()
2.0 1464 3.0 1079 1.0 620 4.0 478 5.0 458 6.0 322 7.0 218 8.0 105 21.0 1 19.0 1 22.0 1 20.0 1 Name: Trips, dtype: int64
TD.loc[TD['Trips'] > 8]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 385 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | 1 | 3.0 | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 |
| 816 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | 3 | 4.0 | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 |
| 2829 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | 1 | 3.0 | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 |
| 3260 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | 3 | 4.0 | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 |
median = np.median(TD.Trips) # find the median income of all customers
mean = np.mean(TD.Trips) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 3.236520640269587 The median equals nan
Passport
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Passport'], palette='winter')
perc_on_bar(ax,TD['Passport'])
Score
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Score'], palette='winter')
perc_on_bar(ax,TD['Score'])
Car
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Car'], palette='winter')
perc_on_bar(ax,TD['Car'])
Children
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Children'], palette='winter')
perc_on_bar(ax,TD['Children'])
Designation
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Designation'], palette='winter')
perc_on_bar(ax,TD['Designation'])
Income
plt.figure(figsize=(15,5))
ax = sns.countplot(TD['Income'], palette='winter')
histogram_boxplot(TD['Income'])
median = np.median(TD.Income) # find the median income of all customers
mean = np.mean(TD.Income) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 23619.85349087003 The median equals nan
There seems to be only a few outliers above the 40000 per month mark. Lets look more closely...
TD.loc[TD['Income'] > 40000]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | 1 | NaN | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
| 2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | 1 | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
There also seems to be two outliers below 10K. Let's look more closely...
TD.loc[TD['Income'] < 10000]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | 3 | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
| 2586 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | 3 | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.
TD.corr() # creates a table of how the numerical values are correlated
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ProdTaken | 1.000000 | -0.147254 | -0.086852 | 0.078257 | 0.009627 | 0.112171 | -0.217461 | 0.099577 | 0.018898 | 0.260844 | 0.051394 | -0.011508 | 0.007421 | -0.130585 |
| Age | -0.147254 | 1.000000 | 0.015625 | -0.012063 | 0.011621 | -0.002577 | 0.425807 | -0.010474 | 0.184905 | 0.033399 | 0.018510 | 0.048654 | 0.007370 | 0.464869 |
| City | -0.086852 | 0.015625 | 1.000000 | -0.022703 | 0.001671 | -0.023652 | -0.180000 | 0.009164 | 0.029709 | -0.001793 | 0.042160 | -0.003817 | -0.000672 | -0.051817 |
| Duration | 0.078257 | -0.012063 | -0.022703 | 1.000000 | 0.065141 | 0.009434 | -0.013265 | -0.006637 | 0.009715 | 0.033034 | -0.002880 | -0.001626 | 0.031408 | -0.006252 |
| Visitors | 0.009627 | 0.011621 | 0.001671 | 0.065141 | 1.000000 | 0.328569 | -0.036206 | 0.033867 | 0.195223 | 0.011177 | -0.019581 | 0.010362 | 0.610621 | 0.195134 |
| Followups | 0.112171 | -0.002577 | -0.023652 | 0.009434 | 0.328569 | 1.000000 | -0.001585 | -0.024176 | 0.139517 | 0.004970 | 0.004054 | 0.012112 | 0.286425 | 0.176503 |
| Product | -0.217461 | 0.425807 | -0.180000 | -0.013265 | -0.036206 | -0.001585 | 1.000000 | -0.024358 | 0.060371 | -0.026083 | 0.020908 | 0.060538 | -0.000037 | 0.652621 |
| Stars | 0.099577 | -0.010474 | 0.009164 | -0.006637 | 0.033867 | -0.024176 | -0.024358 | 1.000000 | 0.012115 | 0.001040 | -0.022701 | 0.015742 | 0.035798 | 0.014289 |
| Trips | 0.018898 | 0.184905 | 0.029709 | 0.009715 | 0.195223 | 0.139517 | 0.060371 | 0.012115 | 1.000000 | 0.012949 | -0.004378 | -0.011825 | 0.168795 | 0.139105 |
| Passport | 0.260844 | 0.033399 | -0.001793 | 0.033034 | 0.011177 | 0.004970 | -0.026083 | 0.001040 | 0.012949 | 1.000000 | 0.002926 | -0.022330 | 0.020264 | 0.002545 |
| Score | 0.051394 | 0.018510 | 0.042160 | -0.002880 | -0.019581 | 0.004054 | 0.020908 | -0.022701 | -0.004378 | 0.002926 | 1.000000 | 0.068850 | 0.000878 | 0.030421 |
| Car | -0.011508 | 0.048654 | -0.003817 | -0.001626 | 0.010362 | 0.012112 | 0.060538 | 0.015742 | -0.011825 | -0.022330 | 0.068850 | 1.000000 | 0.026572 | 0.080262 |
| Children | 0.007421 | 0.007370 | -0.000672 | 0.031408 | 0.610621 | 0.286425 | -0.000037 | 0.035798 | 0.168795 | 0.020264 | 0.000878 | 0.026572 | 1.000000 | 0.201643 |
| Income | -0.130585 | 0.464869 | -0.051817 | -0.006252 | 0.195134 | 0.176503 | 0.652621 | 0.014289 | 0.139105 | 0.002545 | 0.030421 | 0.080262 | 0.201643 | 1.000000 |
plt.figure(figsize=(25,10))
sns.heatmap(TD.corr(), annot=True)
plt.show()
from numpy.polynomial.polynomial import polyfit
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette='nipy_spectral')
tab1 = pd.crosstab(x,data['ProdTaken'],margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(x,data['ProdTaken'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(10,5))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
Dependent Variable vs Numerical Variables
Prodtaken is the dependent variable and thus the most relevant to the model. We will look at relationships between this variable and other numerical variables first, and then any other numerical variable pairs we might find intersting.
# ProdTaken vs Age
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Age'); #should give a visual frequency graph
stacked_plot(TD['Age'])
ProdTaken 0 1 All Age 18.0 6 8 14 19.0 11 21 32 20.0 13 25 38 21.0 18 23 41 22.0 31 15 46 23.0 33 13 46 24.0 38 18 56 25.0 54 20 74 26.0 69 37 106 27.0 108 30 138 28.0 121 26 147 29.0 121 57 178 30.0 151 48 199 31.0 162 41 203 32.0 157 40 197 33.0 149 40 189 34.0 167 44 211 35.0 203 34 237 36.0 204 27 231 37.0 161 24 185 38.0 161 15 176 39.0 136 14 150 40.0 123 23 146 41.0 131 24 155 42.0 122 20 142 43.0 120 10 130 44.0 88 17 105 45.0 98 18 116 46.0 110 11 121 47.0 76 12 88 48.0 54 11 65 49.0 56 9 65 50.0 74 12 86 51.0 73 17 90 52.0 54 14 68 53.0 58 8 66 54.0 59 2 61 55.0 57 7 64 56.0 46 12 58 57.0 24 5 29 58.0 20 11 31 59.0 32 12 44 60.0 28 1 29 61.0 9 0 9 All 3786 876 4662 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs City
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'City'); #should give a visual frequency graph
stacked_plot(TD['City'])
ProdTaken 0 1 All City -1 1146 354 1500 0 152 46 198 1 2670 520 3190 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('City')['ProdTaken'].value_counts(normalize=True)
City ProdTaken
-1 0 0.764000
1 0.236000
0 0 0.767677
1 0.232323
1 0 0.836991
1 0.163009
Name: ProdTaken, dtype: float64
# ProdTaken vs Duration
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Duration'); #should give a visual frequency graph
stacked_plot(TD['Duration'])
ProdTaken 0 1 All Duration 5.0 6 0 6 6.0 268 39 307 7.0 294 48 342 8.0 273 60 333 9.0 404 79 483 10.0 199 45 244 11.0 168 37 205 12.0 162 33 195 13.0 191 32 223 14.0 212 41 253 15.0 217 52 269 16.0 218 56 274 17.0 148 24 172 18.0 56 19 75 19.0 36 21 57 20.0 46 19 65 21.0 57 16 73 22.0 66 23 89 23.0 58 21 79 24.0 54 16 70 25.0 59 14 73 26.0 60 12 72 27.0 58 14 72 28.0 44 17 61 29.0 53 21 74 30.0 65 30 95 31.0 52 31 83 32.0 59 15 74 33.0 46 11 57 34.0 42 8 50 35.0 55 11 66 36.0 34 10 44 126.0 1 0 1 127.0 1 0 1 All 3762 875 4637 ------------------------------------------------------------------------------------------------------------------------
TD.loc[(TD['Duration'] >= 15) & (TD['ProdTaken'] == 1)] # duration is => 15 and customers said yes to the product
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 1 | 28.0 | Self Enquiry | 1 | 30.0 | Salaried | Male | 2 | 4.0 | 1 | 3.0 | Single | 6.0 | 1 | 2 | 0 | 0.0 | Executive | 17028.0 |
| 21 | 1 | NaN | Self Enquiry | -1 | 15.0 | Salaried | Male | 2 | 4.0 | 3 | 5.0 | Single | 1.0 | 0 | 2 | 0 | 0.0 | Manager | 18407.0 |
| 26 | 1 | NaN | Company Invited | 1 | 22.0 | Salaried | Female | 3 | 5.0 | 1 | 5.0 | Single | 2.0 | 1 | 4 | 1 | 2.0 | Executive | NaN |
| 42 | 1 | 26.0 | Self Enquiry | 1 | 31.0 | Salaried | Male | 2 | 5.0 | 1 | 3.0 | Single | 2.0 | 0 | 2 | 1 | 1.0 | Executive | 17293.0 |
| 45 | 1 | 41.0 | Self Enquiry | 1 | 18.0 | Large Business | Female | 2 | 3.0 | 5 | 3.0 | Divorced | 2.0 | 0 | 4 | 1 | 0.0 | VP | 34545.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4881 | 1 | 41.0 | Self Enquiry | 0 | 25.0 | Salaried | Male | 3 | 2.0 | 1 | 5.0 | Married | 2.0 | 0 | 1 | 1 | 2.0 | Executive | 21065.0 |
| 4882 | 1 | 37.0 | Self Enquiry | 0 | 20.0 | Salaried | Male | 3 | 5.0 | 1 | 5.0 | Married | 6.0 | 1 | 5 | 1 | 2.0 | Executive | 23317.0 |
| 4884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | 1 | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 1 | 52.0 | Self Enquiry | -1 | 17.0 | Salaried | Female | 4 | 4.0 | 2 | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 1 | 19.0 | Self Enquiry | -1 | 16.0 | Small Business | Male | 3 | 4.0 | 1 | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
461 rows × 19 columns
# ProdTaken vs Visitors
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'City'); #should give a visual frequency graph
stacked_plot(TD['Visitors'])
ProdTaken 0 1 All Visitors 1 39 0 39 2 1151 267 1418 3 1942 460 2402 4 833 193 1026 5 3 0 3 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Visitors')['ProdTaken'].value_counts(normalize=True)
Visitors ProdTaken
1 0 1.000000
2 0 0.811707
1 0.188293
3 0 0.808493
1 0.191507
4 0 0.811891
1 0.188109
5 0 1.000000
Name: ProdTaken, dtype: float64
# ProdTaken vs Followups
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Followups'); #should give a visual frequency graph
stacked_plot(TD['Followups'])
ProdTaken 0 1 All Followups 1.0 156 20 176 2.0 205 24 229 3.0 1222 244 1466 4.0 1689 379 2068 5.0 577 191 768 6.0 82 54 136 All 3931 912 4843 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Followups')['ProdTaken'].value_counts(normalize=True)
Followups ProdTaken
1.0 0 0.886364
1 0.113636
2.0 0 0.895197
1 0.104803
3.0 0 0.833561
1 0.166439
4.0 0 0.816731
1 0.183269
5.0 0 0.751302
1 0.248698
6.0 0 0.602941
1 0.397059
Name: ProdTaken, dtype: float64
# ProdTaken vs Stars
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Stars'); #should give a visual frequency graph
stacked_plot(TD['Stars'])
ProdTaken 0 1 All Stars 3.0 2511 482 2993 4.0 731 182 913 5.0 706 250 956 All 3948 914 4862 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Stars')['ProdTaken'].value_counts(normalize=True)
Stars ProdTaken
3.0 0 0.838958
1 0.161042
4.0 0 0.800657
1 0.199343
5.0 0 0.738494
1 0.261506
Name: ProdTaken, dtype: float64
# ProdTaken vs Trips
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Trips'); #should give a visual frequency graph
stacked_plot(TD['Trips'])
ProdTaken 0 1 All Trips 1.0 508 112 620 2.0 1165 299 1464 3.0 862 217 1079 4.0 417 61 478 5.0 396 62 458 6.0 258 64 322 7.0 156 62 218 8.0 76 29 105 19.0 0 1 1 20.0 0 1 1 21.0 1 0 1 22.0 1 0 1 All 3840 908 4748 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs Passport
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Passport'); #should give a visual frequency graph
stacked_plot(TD['Passport'])
ProdTaken 0 1 All Passport 0 3040 426 3466 1 928 494 1422 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="ProdTaken", y="Passport", data=TD);
TD.groupby('Pasport')['ProdTaken'].value_counts(normalize=True)
Passport ProdTaken
0 0 0.877092
1 0.122908
1 0 0.652602
1 0.347398
Name: ProdTaken, dtype: float64
# ProdTaken vs Score
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Score'); #should give a visual frequency graph
stacked_plot(TD['Score'])
ProdTaken 0 1 All Score 1 798 144 942 2 498 88 586 3 1162 316 1478 4 750 162 912 5 760 210 970 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Score')['ProdTaken'].value_counts(normalize=True)
Score ProdTaken
1 0 0.847134
1 0.152866
2 0 0.849829
1 0.150171
3 0 0.786198
1 0.213802
4 0 0.822368
1 0.177632
5 0 0.783505
1 0.216495
Name: ProdTaken, dtype: float64
# ProdTaken vs Car
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Car'); #should give a visual frequency graph
stacked_plot(TD['Car'])
ProdTaken 0 1 All Car 0 1496 360 1856 1 2472 560 3032 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Car')['ProdTaken'].value_counts(normalize=True)
Car ProdTaken
0 0 0.806034
1 0.193966
1 0 0.815303
1 0.184697
Name: ProdTaken, dtype: float64
# ProdTaken vs Children
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Children'); #should give a visual frequency graph
stacked_plot(TD['Children'])
ProdTaken 0 1 All Children 0.0 880 202 1082 1.0 1688 392 2080 2.0 1082 253 1335 3.0 259 66 325 All 3909 913 4822 ------------------------------------------------------------------------------------------------------------------------
TD.groupby('Children')['ProdTaken'].value_counts(normalize=True)
Children ProdTaken
0.0 0 0.813309
1 0.186691
1.0 0 0.811538
1 0.188462
2.0 0 0.810487
1 0.189513
3.0 0 0.796923
1 0.203077
Name: ProdTaken, dtype: float64
# ProdTaken vs Income
figure = plt.figure(figsize=(5,5))
x = sns.scatterplot(x="ProdTaken", y="Income", data=TD) # this will give us a visual of the data, and will let us see if the two groups appear similar or not
Other Numerical vs Numerical Variables of Interest
We would like to check is if there are strong correlations between Income and the following variables: age, trips, passport, product and city. The last three showed some of the best potential for predictors for whether a customer will make a purchase.
#Age vs Income
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="Income", data=TD);
sns.boxplot(TD['Age'],TD['Income'])
plt.show()
#Age vs Trips
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="Trips", data=TD);
sns.boxplot(TD['Age'],TD['Trips'])
plt.show()
#Age vs Passport
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="Passport", data=TD);
#Age vs Product
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="Product", data=TD);
#Age vs City
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="City", data=TD);
#Income vs Trips
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Trips", data=TD);
sns.boxplot(TD['Trips'],TD['Income'])
plt.show()
#Income vs Passport
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Passport", data=TD);
#Income vs Product
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Product", data=TD);
#Income vs City
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="City", data=TD);
Since ProdTaken is the dependent variable (y) we will concentrate on the relationships between ProdTaken and the categorical variables. We will also consider other key relationships.
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette='nipy_spectral')
tab1 = pd.crosstab(x,data['ProdTaken'],margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(x,data['ProdTaken'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(10,5))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
# ProdTaken vs Contact
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Contact'); #should give a visual frequency graph
stacked_plot(TD['Contact'])
ProdTaken 0 1 All Contact Company Invited 1109 310 1419 Self Enquiry 2837 607 3444 All 3946 917 4863 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs Occupation
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Occupation'); #should give a visual frequency graph
stacked_plot(TD['Occupation'])
ProdTaken 0 1 All Occupation Free Lancer 0 2 2 Large Business 314 120 434 Salaried 1954 414 2368 Small Business 1700 384 2084 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs Gender
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Gender'); #should give a visual frequency graph
stacked_plot(TD['Gender'])
ProdTaken 0 1 All Gender Female 1630 342 1972 Male 2338 578 2916 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs Product
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Product'); #should give a visual frequency graph
stacked_plot(TD['Product'])
ProdTaken 0 1 All Product 1 1290 552 1842 2 618 124 742 3 1528 204 1732 4 322 20 342 5 210 20 230 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
# ProdTaken vs Designation
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TD, x = 'ProdTaken', hue = 'Designation'); #should give a visual frequency graph
stacked_plot(TD['Designation'])
ProdTaken 0 1 All Designation AVP 322 20 342 Executive 1290 552 1842 Manager 1528 204 1732 Senior Manager 618 124 742 VP 210 20 230 All 3968 920 4888 ------------------------------------------------------------------------------------------------------------------------
Numerical vs Categorical - Other variables of interest
ax = sns.scatterplot(x="Designation", y="Income", data=TD) # this will give us a visual of the data, and will let us see if the two groups appear similar or not
ax = sns.scatterplot(x="Occupation", y="Income", data=TD) # this will give us a visual of the data, and will let us see if the two groups appear similar or not
# Age vs Marital Status
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Age', hue = 'MaritalStatus'); #should give a visual frequency graph
# Age vs Occupation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Age', hue = 'Occupation'); #should give a visual frequency graph
# Age vs Designation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Age', hue = 'Designation'); #should give a visual frequency graph
# Trips vs Marital Status
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Trips', hue = 'MaritalStatus'); #should give a visual frequency graph
# Trips vs Occupation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Trips', hue = 'Occupation'); #should give a visual frequency graph
# Trips vs Designation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Trips', hue = 'Designation'); #should give a visual frequency graph
# Duration vs Marital Status
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Duration', hue = 'MaritalStatus'); #should give a visual frequency graph
# Duration vs Occupation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Duration', hue = 'Occupation'); #should give a visual frequency graph
# Duration vs Designation
figure = plt.figure(figsize=(50,25))
sns.countplot(data = TD, x = 'Duration', hue = 'Designation'); #should give a visual frequency graph
plt.figure(figsize=(15,5))
sns.boxplot(TD['Trips'],TD['Visitors'],hue=TD['Designation'])
plt.show()
We will look at the categorical variables and how they interact with one another.
#Contact vs Occupation
sns.countplot(data = TD, x = 'Contact', hue = 'Occupation')
<AxesSubplot:xlabel='Contact', ylabel='count'>
#Contact vs Gender
sns.countplot(data = TD, x = 'Contact', hue = 'Gender')
<AxesSubplot:xlabel='Contact', ylabel='count'>
#Contact vs Product
sns.countplot(data = TD, x = 'Contact', hue = 'Product')
<AxesSubplot:xlabel='Contact', ylabel='count'>
#Contact vs Marital Status
sns.countplot(data = TD, x = 'Contact', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Contact', ylabel='count'>
#Contact vs Designation
sns.countplot(data = TD, x = 'Contact', hue = 'Designation')
<AxesSubplot:xlabel='Contact', ylabel='count'>
#Occupation vs Gender
sns.countplot(data = TD, x = 'Occupation', hue = 'Gender')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
#Occupation vs Product
sns.countplot(data = TD, x = 'Occupation', hue = 'Product')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
#Occupation vs Marital Status
sns.countplot(data = TD, x = 'Occupation', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
#Occupation vs Designation
sns.countplot(data = TD, x = 'Occupation', hue = 'Designation')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
#Gender vs Product
sns.countplot(data = TD, x = 'Gender', hue = 'Product')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Gender vs Marital Status
sns.countplot(data = TD, x = 'Gender', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Gender vs Designation
sns.countplot(data = TD, x = 'Gender', hue = 'Designation')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = TD, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = TD, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Marital Status vs Designation
sns.countplot(data = TD, x = 'MaritalStatus', hue = 'Designation')
<AxesSubplot:xlabel='MaritalStatus', ylabel='count'>
Lets take one more quick look at the missing values in the dataset ...
TD.isnull().sum() # this will tell us if there are missing values in an of the columns
ProdTaken 0 Age 226 Contact 25 City 0 Duration 251 Occupation 0 Gender 0 Visitors 0 Followups 45 Product 0 Stars 26 MaritalStatus 0 Trips 140 Passport 0 Score 0 Car 0 Children 66 Designation 0 Income 233 dtype: int64
So we know we have 8 columns with missing values to deal with. We have also previously marked outliers which need to be dealt with. Lets deal with the outliers first...
Duration
We know that there were two outliers in Duration. The calls lasted 126 and 127 minutes, respectively. No other call lasted more than 40mins. Clearly, there is an error here. Lets look at the individual rows and check what the mean is for duration...
TD.loc[TD['Duration'] >= 40]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | NaN | Company Invited | -1 | 126.0 | Salaried | Male | 2 | 3.0 | 1 | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
| 3878 | 0 | 53.0 | Company Invited | -1 | 127.0 | Salaried | Male | 3 | 4.0 | 1 | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
We know the mean for trips is 15.49 minutes per call and the median is nan. Lets see if we can find more information about grouped means to help us make a good replacement decision...
np.mean(TD[TD.Contact == 'Company Invited'].Duration) # find the mean of all customers' trips who were invited by the company
15.894302848575713
np.mean(TD[TD.Occupation == 'Salaried'].Duration) # find the mean of all customers' trips who were invited by the company
15.200176756517896
np.mean(TD[TD.Gender == 'Male'].Duration) # find the mean of all customers' trips who were invited by the company
15.555315315315315
np.mean(TD[TD.Product == 1].Duration) # find the mean of all customers' trips who were invited by the company
15.194834362717575
np.mean(TD[TD.Stars == 3].Duration) # find the mean of all customers' trips who were invited by the company
15.52770914225203
np.mean(TD[TD.MaritalStatus == 'Married'].Duration) # find the mean of all customers' trips who were invited by the company
15.704749103942651
np.mean(TD[TD.Passport == 0].Duration) # find the mean of all customers' trips who were invited by the company
15.309640024405125
np.mean(TD[TD.Score == 1].Duration) # find the mean of all customers' trips who were invited by the company
15.315848214285714
np.mean(TD[TD.Car == 1].Duration) # find the mean of all customers' trips who were invited by the company
15.479958173579645
We can see that the means of all the variables they share in common rounds to either 15 or 16 (duration must be an integer). The overall mean rounds down to 15 and so too does 5 of 9 variables. We shall replace the outliers with 15.
TD['Duration'] = TD['Duration'].replace([126],15) #replaces 19 minute outlier with 3 (mean)
TD['Duration'] = TD['Duration'].replace([127],15) #replaces 20 minute outlier with 3 (mean)
TD.loc[[1434], :] #double checking one of the rows which used to contain an outlier (Duration should = 15 now)
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | NaN | Company Invited | -1 | 15.0 | Salaried | Male | 2 | 3.0 | 1 | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
Occupation
We know that there were two entries of Free Lancer in the Occupation data. While not an outlier, they do represent a miniscule portion of the data. It would not be good for our model to include these two occupation types so they need to be dropped.
TD.loc[TD['Occupation'] == 'Free Lancer']
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | 1 | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 2446 | 1 | 38.0 | Self Enquiry | 1 | 9.0 | Free Lancer | Male | 4 | 5.0 | 1 | 3.0 | Single | 8.0 | 1 | 3 | 0 | 1.0 | Executive | 20768.0 |
TD.drop([2,2446], axis=0, inplace=True) # this will drop the two rows
TD.reset_index(inplace=True) #prevents there being a gap in the index
TD.drop('index',axis=1,inplace=True)
Trips
We know that there were four outliers in Trips. The four outliers were 19,20,21 and 22, respectively. No other customer took more than 8 trips. This is probably an error, since it would require these individuals to go on trips almost every second week. They are also such a small representation of the dataset that they can be ignored.
TD.loc[TD['Trips'] >= 9]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 384 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | 1 | 3.0 | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 |
| 815 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | 3 | 4.0 | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 |
| 2827 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | 1 | 3.0 | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 |
| 3258 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | 3 | 4.0 | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 |
We know the mean for trips is 3.24 trips per year and the median is nan. Lets see if we can find more information about grouped means to help us make a good replacement decision...
np.mean(TD[TD.Contact == 'Company Invited'].Trips) # find the mean of all customers' trips who were invited by the company
3.2041998551774076
np.mean(TD[TD.Gender == 'Male'].Trips) # find the mean of all customers' trips who are male
3.229595457771469
np.mean(TD[TD.Car == 1].Trips) # find the mean of all customers' who own a car
3.2193460490463215
We can see that the means of all the variables they share in common rounds to 3 (trips must be an integer), lets look at some other variables that at least two of these customers share...
np.mean(TD[TD.Occupation == 'Salaried'].Trips) # find the mean of all customers' trips who are salaried
3.2210617928633596
np.mean(TD[TD.Occupation == 'Large Business'].Trips) # find the mean of all customers' trips who work for a large business
3.4560185185185186
np.mean(TD[TD.Designation == 'Executive'].Trips) # find the mean of all customers' trips who are executives
3.0915531335149864
np.mean(TD[TD.Designation == 'Manager'].Trips) # find the mean of all customers' trips who are managers
3.2820809248554914
np.mean(TD[TD.MaritalStatus == 'Single'].Trips) # find the mean of all customers' trips who are single
2.940248027057497
np.mean(TD[TD.MaritalStatus == 'Unmarried'].Trips) # find the mean of all customers' who are unmarried
3.365102639296188
In all cases we can see that the mean of trips taken by these subgroups is rounded to 3. It makes the most sense to correct these outliers by replacing their values with 3.
TD['Trips'] = TD['Trips'].replace([19],3) #replaces 19 minute outlier with 3 (mean)
TD['Trips'] = TD['Trips'].replace([20],3) #replaces 20 minute outlier with 3 (mean)
TD['Trips'] = TD['Trips'].replace([21],3) #replaces 21 minute outlier with 3 (mean)
TD['Trips'] = TD['Trips'].replace([22],3) #replaces 22 minute outlier with 3 (mean)
TD.loc[[2829], :] #double checking one of the rows which used to contain an outlier (Trips should = 3 now)
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2829 | 0 | 41.0 | Self Enquiry | -1 | 9.0 | Small Business | Female | 3 | 4.0 | 3 | 4.0 | Married | 2.0 | 0 | 2 | 0 | 2.0 | Manager | 24393.0 |
Income
We know that there were two outliers in Income. The two outliers were 95,000 and 98,678 respectively. No other customer took more than 40,000 in income. This is probably an error, since no one else with their job title or occupation earns anywhere near this amount.
TD.loc[TD['Income'] > 40000]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | 1 | NaN | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
| 2480 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | 1 | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
np.mean(TD[TD.Designation == 'Executive'].Income) # find the mean of all customers' income who are executives
19941.07586993725
np.mean(TD[TD.Gender == 'Female'].Income) # find the mean of all customers' income who are female
23848.32924226254
np.mean(TD[TD.Occupation == 'Salaried'].Income) # find the mean of all customers' income who are salaried
23649.70260715864
np.mean(TD[TD.Age == 36].Income) # find the mean of all customers' income who are 36
23012.339130434782
np.mean(TD[TD.Age == 37].Income) # find the mean of all customers' income who are 37
23692.18888888889
TD['Income'] = TD['Income'].replace([95000],23620) #replaces 95000 outlier with mean
TD['Income'] = TD['Income'].replace([98678],23620) #replaces 98678 outlier with mean
We also discovered through EDA that there were two outliers on the left side of the boxplot as well, where income was much lower than 10k...
TD.loc[TD['Income'] < 10000]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 141 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | 3 | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
| 2584 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | 3 | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
np.mean(TD[TD.Designation == 'Manager'].Income) # find the mean of all customers' income who are managers
22643.15762925599
np.mean(TD[TD.Occupation == 'Large Business'].Income) # find the mean of all customers' income who work for large businesses
22876.969047619048
np.mean(TD[TD.Contact == 'Self Enquiry'].Income) # find the mean of all customers' income who are 38
23672.099878934623
TD['Income'] = TD['Income'].replace([1000],23620) #replaces 95000 outlier with mean
TD['Income'] = TD['Income'].replace([4678],23620) #replaces 98678 outlier with mean
TD.loc[[38,142], :] #double checking one of the rows which used to contained outliers (left and right) (Income should = mean now)
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 0 | 33.0 | Company Invited | -1 | 6.0 | Salaried | Female | 2 | 2.0 | 3 | 5.0 | Divorced | 3.0 | 0 | 3 | 1 | 0.0 | Manager | 20376.0 |
| 142 | 0 | 30.0 | Self Enquiry | 1 | 24.0 | Salaried | Female | 3 | 3.0 | 1 | 3.0 | Single | 2.0 | 0 | 2 | 1 | 0.0 | Executive | 17632.0 |
Normally, we have been going in sequential order and cleaning the variables as they appear in the column, from left to right. Now, however, we will start with the variable with the fewest missing values and work up. This is because we can check more easily to see if rows are missing several values. These rows should be dropped...
Contact
Contact has 25 missing variables. Lets pull the rows where Contact = nan and see if there are other missing variables in the rows...
TD.loc[TD['Contact'].isnull()] # this will only pull rows where Contact = nan
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 223 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | 3 | 3.0 | Divorced | 1.0 | 0 | 3 | 1 | 0.0 | Manager | NaN |
| 570 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | 1 | 3.0 | Married | 4.0 | 0 | 4 | 1 | 2.0 | Executive | NaN |
| 571 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | 3 | 3.0 | Divorced | 5.0 | 0 | 2 | 1 | 0.0 | Manager | NaN |
| 575 | 0 | 27.0 | NaN | -1 | NaN | Small Business | Male | 2 | 3.0 | 3 | 3.0 | Divorced | 1.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 578 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | 1 | 5.0 | Single | 2.0 | 0 | 2 | 1 | 1.0 | Executive | NaN |
| 597 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | 1 | 3.0 | Single | 7.0 | 0 | 3 | 0 | 0.0 | Executive | NaN |
| 621 | 0 | 32.0 | NaN | -1 | NaN | Salaried | Male | 3 | 3.0 | 3 | 3.0 | Married | 3.0 | 0 | 2 | 0 | 0.0 | Manager | NaN |
| 723 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | 3 | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 842 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | 1 | 3.0 | Divorced | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
| 1020 | 1 | 25.0 | NaN | -1 | NaN | Salaried | Male | 3 | 4.0 | 1 | 5.0 | Divorced | 4.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 1046 | 0 | 33.0 | NaN | -1 | NaN | Small Business | Male | 2 | 3.0 | 3 | 5.0 | Divorced | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1142 | 0 | 45.0 | NaN | -1 | NaN | Small Business | Male | 2 | 4.0 | 3 | 5.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1181 | 0 | 36.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | 3 | 3.0 | Married | 1.0 | 0 | 5 | 1 | 1.0 | Manager | NaN |
| 1216 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Male | 3 | 1.0 | 1 | 3.0 | Married | 2.0 | 0 | 1 | 0 | 0.0 | Executive | NaN |
| 1355 | 0 | 41.0 | NaN | -1 | NaN | Small Business | Female | 2 | 3.0 | 3 | 4.0 | Married | 6.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 1468 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | 3 | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 1693 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | 3 | 3.0 | Married | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2040 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | 1 | 3.0 | Married | 4.0 | 0 | 4 | 1 | 0.0 | Executive | NaN |
| 2041 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | 3 | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | Manager | NaN |
| 2045 | 0 | 27.0 | NaN | -1 | NaN | Small Business | Male | 2 | 3.0 | 3 | 3.0 | Married | 1.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 2048 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | 1 | 5.0 | Single | 2.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 2067 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | 1 | 3.0 | Single | 7.0 | 0 | 3 | 1 | 1.0 | Executive | NaN |
| 2091 | 0 | 32.0 | NaN | -1 | NaN | Salaried | Male | 3 | 3.0 | 3 | 3.0 | Married | 3.0 | 0 | 1 | 0 | 2.0 | Manager | NaN |
| 2193 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | 3 | 3.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2312 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | 1 | 3.0 | Married | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
Unfortunately, for EVERY instance Contact = nan, so too does Duration = nan AND Income = nan. Thats a lot of missing information in these 25 rows. One option is to replace all these instances with means or medians, however, we begin to start making the data more and more similar when we do this, which can result in multicollinearity and we risk creating a model that overfits. An alternative is to simply drop all 25 rows. We know this will deal with the almost all of the missing duration values simultaneously. Lets check how much of the overall data this represents....
percentage = 25/4888*100
print('25 rows represents about',round(percentage, 2),'%')
25 rows represents about 0.51 %
Half of a percentage seems reasonable, especially considering there are three columns of missing data per row. It makes more sense to drop this data than to clean it. Therefore, we will drop it...
TD.dropna(subset = ['Contact'],inplace=True)
TD.reset_index(inplace=True) #prevents there being a gap in the index
TD.drop('index',axis=1,inplace=True)
TD.loc[[2313], :]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2313 | 0 | 46.0 | Self Enquiry | 1 | 21.0 | Salaried | Male | 2 | 3.0 | 5 | 4.0 | Married | 6.0 | 0 | 3 | 1 | 1.0 | VP | 34081.0 |
Stars
The Stars variable had 26 missing variables. Lets pull the rows where Stars = nan and see if there are other missing variables in the rows...
TD.loc[TD['Stars'].isnull()] # this will only pull rows where Contact = nan
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | 1 | NaN | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 23620.0 |
| 2582 | 0 | 51.0 | Self Enquiry | 1 | 18.0 | Salaried | Female | 3 | 4.0 | 5 | NaN | Single | 5.0 | 0 | 5 | 1 | 1.0 | VP | 38604.0 |
| 2607 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | 5 | NaN | Divorced | 2.0 | 0 | 2 | 1 | 2.0 | VP | 38677.0 |
| 2985 | 1 | 56.0 | Self Enquiry | 1 | 9.0 | Small Business | Male | 4 | 4.0 | 5 | NaN | Divorced | 7.0 | 1 | 2 | 1 | 3.0 | VP | 38537.0 |
| 3163 | 0 | 42.0 | Company Invited | 1 | 14.0 | Salaried | Female | 3 | 6.0 | 5 | NaN | Married | 3.0 | 0 | 4 | 1 | 1.0 | VP | 38651.0 |
| 3166 | 1 | 53.0 | Self Enquiry | -1 | 9.0 | Small Business | Female | 3 | 6.0 | 5 | NaN | Divorced | 3.0 | 0 | 3 | 1 | 1.0 | VP | 38523.0 |
| 3187 | 0 | 47.0 | Self Enquiry | 1 | 7.0 | Small Business | Male | 3 | 4.0 | 5 | NaN | Married | 2.0 | 0 | 5 | 1 | 2.0 | VP | 38305.0 |
| 3268 | 0 | 57.0 | Self Enquiry | 1 | 11.0 | Large Business | Female | 4 | 4.0 | 5 | NaN | Married | 6.0 | 0 | 4 | 0 | 3.0 | VP | 38621.0 |
| 3315 | 0 | 44.0 | Self Enquiry | 1 | 10.0 | Salaried | Male | 4 | 6.0 | 5 | NaN | Divorced | 5.0 | 0 | 5 | 1 | 3.0 | VP | 38418.0 |
| 3335 | 0 | 52.0 | Company Invited | -1 | 16.0 | Salaried | Male | 3 | 4.0 | 5 | NaN | Married | 6.0 | 1 | 4 | 1 | 2.0 | VP | 38525.0 |
| 3373 | 0 | 57.0 | Self Enquiry | 0 | 15.0 | Salaried | Male | 3 | 4.0 | 5 | NaN | Single | 8.0 | 0 | 4 | 1 | 1.0 | VP | 38395.0 |
| 3426 | 0 | 59.0 | Self Enquiry | 1 | 7.0 | Small Business | Female | 4 | 4.0 | 5 | NaN | Divorced | 5.0 | 1 | 1 | 1 | 3.0 | VP | 38379.0 |
| 3571 | 0 | 48.0 | Self Enquiry | 0 | 33.0 | Salaried | Female | 4 | 4.0 | 5 | NaN | Married | 5.0 | 0 | 4 | 1 | 2.0 | VP | 38336.0 |
| 3659 | 0 | 41.0 | Self Enquiry | -1 | 14.0 | Small Business | Male | 3 | 4.0 | 5 | NaN | Single | 3.0 | 0 | 4 | 1 | 2.0 | VP | 38511.0 |
| 3748 | 0 | 49.0 | Self Enquiry | 1 | 17.0 | Salaried | Male | 4 | 5.0 | 5 | NaN | Married | 6.0 | 0 | 3 | 1 | 2.0 | VP | 38343.0 |
| 3818 | 0 | 56.0 | Self Enquiry | 0 | 33.0 | Salaried | Male | 4 | 2.0 | 5 | NaN | Married | 6.0 | 1 | 5 | 1 | 3.0 | VP | 38314.0 |
| 4052 | 0 | 51.0 | Self Enquiry | 1 | 18.0 | Salaried | Female | 3 | 4.0 | 5 | NaN | Single | 5.0 | 0 | 5 | 0 | 2.0 | VP | 38604.0 |
| 4077 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | 5 | NaN | Married | 2.0 | 0 | 1 | 1 | 3.0 | VP | 38677.0 |
| 4455 | 1 | 56.0 | Self Enquiry | 1 | 9.0 | Small Business | Male | 4 | 4.0 | 5 | NaN | Married | 7.0 | 1 | 1 | 1 | 1.0 | VP | 38537.0 |
| 4633 | 0 | 42.0 | Company Invited | 1 | 14.0 | Salaried | Female | 3 | 6.0 | 5 | NaN | Married | 3.0 | 0 | 4 | 1 | 2.0 | VP | 38651.0 |
| 4636 | 1 | 53.0 | Self Enquiry | -1 | 9.0 | Small Business | Female | 3 | 6.0 | 5 | NaN | Married | 3.0 | 0 | 3 | 1 | 2.0 | VP | 38523.0 |
| 4657 | 0 | 47.0 | Self Enquiry | 1 | 7.0 | Small Business | Male | 3 | 4.0 | 5 | NaN | Married | 2.0 | 0 | 5 | 1 | 1.0 | VP | 38305.0 |
| 4738 | 0 | 57.0 | Self Enquiry | 1 | 11.0 | Large Business | Female | 4 | 4.0 | 5 | NaN | Married | 6.0 | 0 | 4 | 1 | 2.0 | VP | 38621.0 |
| 4785 | 0 | 44.0 | Self Enquiry | 1 | 10.0 | Salaried | Male | 4 | 6.0 | 5 | NaN | Married | 5.0 | 0 | 5 | 1 | 1.0 | VP | 38418.0 |
| 4805 | 1 | 52.0 | Company Invited | 1 | 35.0 | Salaried | Male | 4 | 5.0 | 3 | NaN | Single | 5.0 | 0 | 3 | 0 | 1.0 | Manager | 38525.0 |
| 4843 | 1 | 57.0 | Self Enquiry | -1 | 23.0 | Salaried | Female | 4 | 4.0 | 2 | NaN | Single | 4.0 | 1 | 5 | 1 | 3.0 | Senior Manager | 38395.0 |
The rows with missing Stars values have otherwise complete rows. This is good. Unlike Contact, we should keep the data. We can also see a pattern or two. MOST of the customers with missing Stars values were Self Enquiry, are VPs and chose King. We can look at the modes of all three.
TD.Stars.mode() #checking the mode of Stars for all customers
0 3.0 dtype: float64
TD[TD.Contact == 'Self Enquiry'].Stars.mode() #checking the mode of Stars for customers where Contact = Self Enquiry
0 3.0 dtype: float64
TD[TD.Product == 'King'].Stars.mode() #checking the mode of Stars for customers where Product = King
Series([], dtype: float64)
TD[TD.Designation == 'VP'].Stars.mode() #checking the mode of Stars for customers where Designation = VP
0 3.0 dtype: float64
We can see that, for all the subgroups, the mode of Stars is 3. We can replace all the nans in Stars with the mode...
TD['Stars'].fillna(value=3, inplace=True) # changes the NaNs to 3s
Followups
The Followups variable had 45 missing variables. Lets pull the rows where Follows = nan and see if there are other missing variables in the rows...
TD.loc[TD['Followups'].isnull()] # this will only pull rows where Contact = nan
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 78 | 0 | 46.0 | Self Enquiry | 0 | 11.0 | Small Business | Male | 3 | NaN | 3 | 4.0 | Married | 1.0 | 1 | 5 | 0 | 1.0 | Manager | 20021.0 |
| 93 | 0 | 32.0 | Self Enquiry | -1 | 12.0 | Small Business | Male | 2 | NaN | 3 | 3.0 | Single | 2.0 | 0 | 5 | 1 | 0.0 | Manager | 20010.0 |
| 95 | 0 | 24.0 | Self Enquiry | -1 | 9.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Divorced | 1.0 | 0 | 4 | 1 | 0.0 | Manager | 19999.0 |
| 121 | 1 | 56.0 | Self Enquiry | 1 | 20.0 | Salaried | Female | 3 | NaN | 1 | 4.0 | Divorced | 1.0 | 1 | 5 | 1 | 1.0 | Executive | 19963.0 |
| 134 | 0 | 36.0 | Self Enquiry | 1 | 12.0 | Small Business | Male | 2 | NaN | 3 | 4.0 | Married | 7.0 | 0 | 4 | 1 | 1.0 | Manager | 19941.0 |
| 173 | 0 | 45.0 | Self Enquiry | -1 | 10.0 | Salaried | Female | 1 | NaN | 3 | 5.0 | Married | 5.0 | 1 | 4 | 0 | 0.0 | Manager | 20006.0 |
| 315 | 1 | 52.0 | Self Enquiry | 1 | 14.0 | Small Business | Male | 2 | NaN | 3 | 4.0 | Divorced | 3.0 | 0 | 2 | 1 | 1.0 | Manager | 19941.0 |
| 320 | 0 | 32.0 | Self Enquiry | 1 | 8.0 | Small Business | Female | 3 | NaN | 3 | 3.0 | Single | 1.0 | 0 | 3 | 1 | 2.0 | Manager | 20055.0 |
| 374 | 0 | 51.0 | Self Enquiry | -1 | 20.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Divorced | 5.0 | 0 | 3 | 0 | 1.0 | Manager | 19936.0 |
| 530 | 0 | 47.0 | Self Enquiry | -1 | 20.0 | Small Business | Male | 2 | NaN | 3 | 5.0 | Single | 3.0 | 0 | 2 | 0 | 1.0 | Manager | 19960.0 |
| 621 | 0 | 28.0 | Self Enquiry | 0 | 14.0 | Small Business | Male | 3 | NaN | 1 | 3.0 | Married | 2.0 | 0 | 2 | 0 | 1.0 | Executive | 19936.0 |
| 728 | 0 | 41.0 | Self Enquiry | 1 | 13.0 | Small Business | Female | 2 | NaN | 3 | 3.0 | Single | 7.0 | 0 | 3 | 1 | 0.0 | Manager | 20003.0 |
| 739 | 1 | 26.0 | Company Invited | -1 | 35.0 | Small Business | Male | 3 | NaN | 3 | 5.0 | Single | 1.0 | 0 | 3 | 0 | 0.0 | Manager | 19969.0 |
| 811 | 0 | 35.0 | Company Invited | -1 | 17.0 | Small Business | Male | 2 | NaN | 3 | 3.0 | Married | 2.0 | 0 | 5 | 0 | 0.0 | Manager | 19968.0 |
| 871 | 0 | 32.0 | Company Invited | 1 | 8.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Single | 5.0 | 1 | 5 | 1 | 1.0 | Manager | 19998.0 |
| 875 | 0 | 25.0 | Self Enquiry | -1 | 16.0 | Salaried | Male | 2 | NaN | 3 | 3.0 | Single | 1.0 | 0 | 2 | 0 | 1.0 | Manager | 19950.0 |
| 1146 | 0 | 39.0 | Company Invited | 1 | 10.0 | Small Business | Female | 2 | NaN | 3 | 3.0 | Single | 1.0 | 0 | 3 | 1 | 0.0 | Manager | 20042.0 |
| 1201 | 0 | 35.0 | Company Invited | 1 | 8.0 | Small Business | Male | 3 | NaN | 1 | 3.0 | Single | 1.0 | 1 | 3 | 1 | 0.0 | Executive | 19930.0 |
| 1219 | 0 | 47.0 | Company Invited | -1 | 8.0 | Small Business | Male | 2 | NaN | 3 | 4.0 | Married | 1.0 | 0 | 1 | 0 | 1.0 | Manager | 19978.0 |
| 1229 | 0 | 30.0 | Company Invited | 1 | 8.0 | Large Business | Female | 3 | NaN | 1 | 5.0 | Single | 1.0 | 0 | 3 | 0 | 1.0 | Executive | 19968.0 |
| 1337 | 0 | 44.0 | Self Enquiry | 1 | 6.0 | Salaried | Male | 3 | NaN | 3 | 5.0 | Married | 3.0 | 0 | 3 | 1 | 2.0 | Manager | 20033.0 |
| 1373 | 0 | 31.0 | Company Invited | 1 | 6.0 | Salaried | Male | 2 | NaN | 3 | 5.0 | Married | 2.0 | 0 | 1 | 1 | 1.0 | Manager | 20003.0 |
| 1532 | 0 | 46.0 | Self Enquiry | 0 | 11.0 | Small Business | Male | 3 | NaN | 3 | 4.0 | Married | 1.0 | 1 | 5 | 1 | 2.0 | Manager | 20021.0 |
| 1547 | 0 | 32.0 | Self Enquiry | -1 | 12.0 | Small Business | Male | 2 | NaN | 3 | 3.0 | Single | 2.0 | 0 | 5 | 1 | 1.0 | Manager | 20010.0 |
| 1549 | 0 | 24.0 | Self Enquiry | -1 | 9.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Married | 1.0 | 0 | 4 | 0 | 0.0 | Manager | 19999.0 |
| 1575 | 1 | 56.0 | Self Enquiry | 1 | 20.0 | Salaried | Female | 3 | NaN | 1 | 4.0 | Married | 1.0 | 1 | 5 | 1 | 0.0 | Executive | 19963.0 |
| 1588 | 0 | 36.0 | Self Enquiry | 1 | 12.0 | Small Business | Male | 2 | NaN | 3 | 4.0 | Married | 7.0 | 0 | 4 | 0 | 1.0 | Manager | 19941.0 |
| 1627 | 0 | 45.0 | Self Enquiry | -1 | 10.0 | Salaried | Female | 1 | NaN | 3 | 5.0 | Married | 5.0 | 1 | 4 | 0 | 0.0 | Manager | 20006.0 |
| 1769 | 1 | 52.0 | Self Enquiry | 1 | 14.0 | Small Business | Male | 2 | NaN | 3 | 4.0 | Married | 3.0 | 0 | 1 | 1 | 0.0 | Manager | 19941.0 |
| 1828 | 0 | 51.0 | Self Enquiry | -1 | 20.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Married | 5.0 | 0 | 3 | 1 | 0.0 | Manager | 19936.0 |
| 1984 | 0 | 47.0 | Self Enquiry | -1 | 20.0 | Small Business | Male | 2 | NaN | 3 | 5.0 | Single | 3.0 | 0 | 1 | 1 | 1.0 | Manager | 19960.0 |
| 2075 | 0 | 28.0 | Self Enquiry | 0 | 14.0 | Small Business | Male | 3 | NaN | 1 | 3.0 | Married | 2.0 | 0 | 1 | 1 | 0.0 | Executive | 19936.0 |
| 2105 | 0 | 28.0 | Self Enquiry | -1 | 11.0 | Small Business | Male | 3 | NaN | 3 | 3.0 | Single | 2.0 | 0 | 3 | 0 | 1.0 | Manager | 19908.0 |
| 2182 | 0 | 41.0 | Self Enquiry | 1 | 13.0 | Small Business | Female | 2 | NaN | 3 | 3.0 | Single | 7.0 | 0 | 3 | 1 | 1.0 | Manager | 20003.0 |
| 2193 | 1 | 26.0 | Company Invited | -1 | 35.0 | Small Business | Male | 3 | NaN | 3 | 5.0 | Single | 1.0 | 0 | 3 | 1 | 1.0 | Manager | 19969.0 |
| 2265 | 0 | 35.0 | Company Invited | -1 | 17.0 | Small Business | Male | 2 | NaN | 3 | 3.0 | Married | 2.0 | 0 | 5 | 1 | 1.0 | Manager | 19968.0 |
| 2325 | 0 | 32.0 | Company Invited | 1 | 8.0 | Salaried | Female | 2 | NaN | 3 | 3.0 | Single | 5.0 | 1 | 5 | 1 | 0.0 | Manager | 19998.0 |
| 2329 | 0 | 25.0 | Self Enquiry | -1 | 16.0 | Salaried | Male | 2 | NaN | 3 | 3.0 | Single | 1.0 | 0 | 1 | 1 | 0.0 | Manager | 19950.0 |
| 2440 | 0 | 22.0 | Self Enquiry | 1 | 22.0 | Salaried | Male | 4 | NaN | 1 | 3.0 | Single | 3.0 | 0 | 3 | 1 | 2.0 | Executive | 19910.0 |
| 2932 | 0 | 36.0 | Company Invited | 1 | 10.0 | Salaried | Male | 3 | NaN | 1 | 3.0 | Divorced | 3.0 | 0 | 2 | 1 | 2.0 | Executive | 19959.0 |
| 3429 | 1 | 32.0 | Company Invited | -1 | 7.0 | Salaried | Female | 3 | NaN | 1 | 3.0 | Single | 3.0 | 0 | 3 | 1 | 2.0 | Executive | 20037.0 |
| 3433 | 1 | 32.0 | Self Enquiry | 1 | 15.0 | Salaried | Female | 3 | NaN | 1 | 4.0 | Single | 3.0 | 0 | 4 | 0 | 2.0 | Executive | 19939.0 |
| 3469 | 0 | 31.0 | Company Invited | 1 | 14.0 | Large Business | Male | 4 | NaN | 1 | 3.0 | Married | 3.0 | 0 | 5 | 0 | 1.0 | Executive | 19952.0 |
| 3910 | 0 | 22.0 | Self Enquiry | 1 | 22.0 | Salaried | Male | 4 | NaN | 1 | 3.0 | Single | 3.0 | 0 | 5 | 1 | 2.0 | Executive | 19910.0 |
| 4402 | 0 | 36.0 | Company Invited | 1 | 10.0 | Salaried | Male | 3 | NaN | 1 | 3.0 | Married | 3.0 | 0 | 1 | 1 | 2.0 | Executive | 19959.0 |
The other values in the rows where Followups = NaN do not have much in common. There seems to be an even divide between Salaried and Small Business, as well an even divide between Manager and Executive. We can check the modes of these 4 values to see the appropriate replacement value.
TD.Followups.mode()
0 4.0 dtype: float64
TD[TD.Designation == 'Manager'].Followups.mode() #checking the mode of Followups for customers where Designation = Manager
0 4.0 dtype: float64
TD[TD.Designation == 'Executive'].Followups.mode() #checking the mode of Followups for customers where Designation = Executive
0 4.0 dtype: float64
TD[TD.Occupation == 'Salaried'].Followups.mode() #checking the mode of Followups for customers where Occupation = Salaried
0 4.0 dtype: float64
TD[TD.Occupation == 'Small Business'].Followups.mode() #checking the mode of Followups for customers where Occupation = Small Business
0 4.0 dtype: float64
We can see that, for all the subgroups, the mode of Followups is 4. We can replace all the nans in Followups with the mode...
TD['Followups'].fillna(value=4, inplace=True) # changes the NaNs to 4s
Children
The Children variable had 66 missing variables. Lets pull the rows where Children = nan and see if there are other missing variables in the rows...
TD.loc[TD['Children'].isnull()] # this will only pull rows where Contact = nan
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 0 | 50.0 | Self Enquiry | 1 | 17.0 | Salaried | Female | 2 | 3.0 | 5 | 4.0 | Single | 4.0 | 0 | 5 | 1 | NaN | VP | 34926.0 |
| 189 | 0 | 52.0 | Self Enquiry | 1 | 6.0 | Salaried | Male | 3 | 4.0 | 5 | 3.0 | Divorced | 1.0 | 0 | 2 | 1 | NaN | VP | 34999.0 |
| 566 | 1 | 55.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3 | 3.0 | 5 | 5.0 | Divorced | 6.0 | 1 | 2 | 1 | NaN | VP | 34859.0 |
| 737 | 0 | 41.0 | Company Invited | 1 | 13.0 | Salaried | Female | 2 | 5.0 | 5 | 3.0 | Married | 2.0 | 0 | 4 | 1 | NaN | VP | 34973.0 |
| 740 | 1 | 52.0 | Self Enquiry | -1 | 8.0 | Small Business | Female | 2 | 5.0 | 5 | 3.0 | Divorced | 2.0 | 0 | 3 | 1 | NaN | VP | 34845.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4448 | 0 | 53.0 | Self Enquiry | -1 | 10.0 | Small Business | Male | 3 | 5.0 | 4 | 5.0 | Married | 3.0 | 0 | 5 | 1 | NaN | AVP | 35534.0 |
| 4479 | 0 | 48.0 | Self Enquiry | 1 | 9.0 | Salaried | Female | 3 | 4.0 | 4 | 3.0 | Married | 3.0 | 1 | 4 | 1 | NaN | AVP | 35430.0 |
| 4503 | 0 | 52.0 | Self Enquiry | -1 | 33.0 | Small Business | Female | 4 | 4.0 | 4 | 3.0 | Married | 4.0 | 0 | 3 | 1 | NaN | AVP | 34985.0 |
| 4661 | 0 | 56.0 | Company Invited | 1 | 9.0 | Small Business | Male | 3 | 5.0 | 4 | 5.0 | Single | 2.0 | 0 | 3 | 1 | NaN | AVP | 35434.0 |
| 4792 | 1 | 30.0 | Self Enquiry | 1 | 14.0 | Large Business | Female | 3 | 4.0 | 1 | 3.0 | Married | 5.0 | 1 | 4 | 1 | NaN | Executive | 34802.0 |
66 rows × 19 columns
There are too many rows to see all the data, but one thing that stands out is the salaries are all tightly clustered. It looks as if the salaries stay between 34k and 36k. Lets see if this is true...
Children = TD.loc[TD['Children'].isnull()].copy() # this will create a dataframe of all rows where Choldren = nan
Children.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 0.0 | 66.000000 |
| mean | 0.106061 | 47.848485 | 0.318182 | 15.151515 | 3.075758 | 3.939394 | 4.227273 | 3.787879 | 4.030303 | 0.272727 | 3.363636 | 0.742424 | NaN | 35052.378788 |
| std | 0.310275 | 6.926521 | 0.914147 | 9.047299 | 0.639966 | 0.909417 | 0.697147 | 0.832330 | 1.607384 | 0.448775 | 1.308391 | 0.440650 | NaN | 303.626023 |
| min | 0.000000 | 30.000000 | -1.000000 | 6.000000 | 2.000000 | 1.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | NaN | 34636.000000 |
| 25% | 0.000000 | 43.000000 | -1.000000 | 9.000000 | 3.000000 | 3.250000 | 4.000000 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 0.250000 | NaN | 34802.000000 |
| 50% | 0.000000 | 48.000000 | 1.000000 | 10.500000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 0.000000 | 3.000000 | 1.000000 | NaN | 34979.000000 |
| 75% | 0.000000 | 53.000000 | 1.000000 | 17.750000 | 3.000000 | 4.750000 | 5.000000 | 4.750000 | 5.000000 | 1.000000 | 4.000000 | 1.000000 | NaN | 35329.000000 |
| max | 1.000000 | 58.000000 | 1.000000 | 35.000000 | 4.000000 | 6.000000 | 5.000000 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | NaN | 35558.000000 |
As we thought, the range goes from 34636 to 35558. It may be better to use the mode of children for people in this income range...
TD.Children.mode() # mode of all children
0 1.0 dtype: float64
TD.loc[(TD['Income'] >= 34636) & (TD['Income'] <= 35558)] # we'll start with over 5000
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 0 | 50.0 | Self Enquiry | 1 | 17.0 | Salaried | Female | 2 | 3.0 | 5 | 4.0 | Single | 4.0 | 0 | 5 | 1 | NaN | VP | 34926.0 |
| 189 | 0 | 52.0 | Self Enquiry | 1 | 6.0 | Salaried | Male | 3 | 4.0 | 5 | 3.0 | Divorced | 1.0 | 0 | 2 | 1 | NaN | VP | 34999.0 |
| 566 | 1 | 55.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3 | 3.0 | 5 | 5.0 | Divorced | 6.0 | 1 | 2 | 1 | NaN | VP | 34859.0 |
| 737 | 0 | 41.0 | Company Invited | 1 | 13.0 | Salaried | Female | 2 | 5.0 | 5 | 3.0 | Married | 2.0 | 0 | 4 | 1 | NaN | VP | 34973.0 |
| 740 | 1 | 52.0 | Self Enquiry | -1 | 8.0 | Small Business | Female | 2 | 5.0 | 5 | 3.0 | Divorced | 2.0 | 0 | 3 | 1 | NaN | VP | 34845.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4479 | 0 | 48.0 | Self Enquiry | 1 | 9.0 | Salaried | Female | 3 | 4.0 | 4 | 3.0 | Married | 3.0 | 1 | 4 | 1 | NaN | AVP | 35430.0 |
| 4503 | 0 | 52.0 | Self Enquiry | -1 | 33.0 | Small Business | Female | 4 | 4.0 | 4 | 3.0 | Married | 4.0 | 0 | 3 | 1 | NaN | AVP | 34985.0 |
| 4661 | 0 | 56.0 | Company Invited | 1 | 9.0 | Small Business | Male | 3 | 5.0 | 4 | 5.0 | Single | 2.0 | 0 | 3 | 1 | NaN | AVP | 35434.0 |
| 4792 | 1 | 30.0 | Self Enquiry | 1 | 14.0 | Large Business | Female | 3 | 4.0 | 1 | 3.0 | Married | 5.0 | 1 | 4 | 1 | NaN | Executive | 34802.0 |
| 4832 | 1 | 51.0 | Company Invited | -1 | 19.0 | Salaried | Male | 3 | 4.0 | 3 | 3.0 | Single | 5.0 | 1 | 3 | 1 | 2.0 | Manager | 35558.0 |
68 rows × 19 columns
Unfortunately, there are only 68 rows of customers with this income range, and we know 66 of them Children = NaN. This forces us to abandon this idea, and replace the NaNs with the column mode.
TD['Children'].fillna(value=1, inplace=True) # changes the NaNs to 4s
Trips
The Trips variable had 140 missing variables. The number of missing rows is getting to be too big to examine visually, so lets create a dataframe, like we did with children, to look for a pattern....
Trips = TD.loc[TD['Trips'].isnull()].copy() # this will create a dataframe of all rows where Choldren = nan
Trips.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 140.000000 | 140.000000 | 140.000000 | 140.000000 | 140.000000 | 140.000000 | 140.000000 | 140.000000 | 0.0 | 140.000000 | 140.00000 | 140.000000 | 140.000000 | 140.000000 |
| mean | 0.085714 | 47.878571 | 0.492857 | 13.664286 | 3.364286 | 4.128571 | 4.450000 | 3.557143 | NaN | 0.271429 | 3.15000 | 0.685714 | 1.778571 | 37062.464286 |
| std | 0.280947 | 5.535325 | 0.826681 | 7.252497 | 0.539260 | 0.708051 | 0.867685 | 0.824721 | NaN | 0.446293 | 1.45891 | 0.465898 | 0.700132 | 833.351079 |
| min | 0.000000 | 34.000000 | -1.000000 | 6.000000 | 2.000000 | 2.000000 | 1.000000 | 3.000000 | NaN | 0.000000 | 1.00000 | 0.000000 | 1.000000 | 35563.000000 |
| 25% | 0.000000 | 43.000000 | 0.000000 | 9.000000 | 3.000000 | 4.000000 | 4.000000 | 3.000000 | NaN | 0.000000 | 2.00000 | 0.000000 | 1.000000 | 36317.000000 |
| 50% | 0.000000 | 47.000000 | 1.000000 | 11.000000 | 3.000000 | 4.000000 | 5.000000 | 3.000000 | NaN | 0.000000 | 3.00000 | 1.000000 | 2.000000 | 37284.000000 |
| 75% | 0.000000 | 51.000000 | 1.000000 | 17.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | NaN | 1.000000 | 5.00000 | 1.000000 | 2.000000 | 37845.500000 |
| max | 1.000000 | 60.000000 | 1.000000 | 36.000000 | 4.000000 | 6.000000 | 5.000000 | 5.000000 | NaN | 1.000000 | 5.00000 | 1.000000 | 3.000000 | 38215.000000 |
Trips.sample(10)
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4586 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 4.0 | 4 | 3.0 | Married | NaN | 0 | 1 | 1 | 3.0 | AVP | 35777.0 |
| 4131 | 0 | 46.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 2.0 | 5 | 3.0 | Married | NaN | 0 | 4 | 1 | 2.0 | VP | 37880.0 |
| 3552 | 0 | 47.0 | Self Enquiry | -1 | 7.0 | Salaried | Male | 3 | 2.0 | 4 | 5.0 | Single | NaN | 0 | 4 | 1 | 2.0 | AVP | 36245.0 |
| 3949 | 0 | 51.0 | Self Enquiry | 1 | 14.0 | Small Business | Female | 3 | 5.0 | 5 | 3.0 | Married | NaN | 1 | 4 | 1 | 2.0 | VP | 37418.0 |
| 3441 | 0 | 48.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 3 | 4.0 | 4 | 3.0 | Divorced | NaN | 0 | 5 | 1 | 1.0 | AVP | 35847.0 |
| 3694 | 0 | 47.0 | Self Enquiry | 1 | 9.0 | Small Business | Male | 3 | 4.0 | 5 | 3.0 | Married | NaN | 1 | 1 | 1 | 2.0 | VP | 38006.0 |
| 4686 | 0 | 43.0 | Self Enquiry | 1 | 9.0 | Small Business | Male | 3 | 4.0 | 4 | 5.0 | Married | NaN | 0 | 3 | 1 | 1.0 | AVP | 36343.0 |
| 4748 | 0 | 47.0 | Self Enquiry | -1 | 9.0 | Small Business | Female | 4 | 4.0 | 4 | 5.0 | Married | NaN | 0 | 1 | 1 | 2.0 | AVP | 35726.0 |
| 3372 | 0 | 56.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 3 | 6.0 | 5 | 3.0 | Divorced | NaN | 0 | 4 | 1 | 2.0 | VP | 37865.0 |
| 4124 | 0 | 53.0 | Company Invited | -1 | 9.0 | Salaried | Male | 4 | 5.0 | 5 | 3.0 | Single | NaN | 1 | 1 | 1 | 1.0 | VP | 37746.0 |
From the sample we selected, we can see that ProdTaken = 0 in every case, Self Enquiry is high, Salaried and Small Business dominate, most select 3 or 4 visitors, Product is either 3 or 4 and VP or AVP is the most popular Designation. The income range is tight, as we have seen from .describe with a range of 35k to 39k. Lets look at some value counts to select a good group representative....
TD.Trips.mode() # mode of all trips
0 2.0 dtype: float64
Trips["ProdTaken"].value_counts()
0 128 1 12 Name: ProdTaken, dtype: int64
This is probably not a good indicator, because we know 1 represents just under 19% of the total dataset, and here it represents under 9%.
Trips["Occupation"].value_counts()
Salaried 70 Small Business 68 Large Business 2 Name: Occupation, dtype: int64
TD[TD.Occupation == 'Salaried'].Trips.mode() #checking the mode of Trips for customers where Occupation = Salaried
0 2.0 dtype: float64
TD[TD.Occupation == 'Small Business'].Trips.mode() #checking the mode of Trips for customers where Occupation = Small Business
0 2.0 dtype: float64
Trips["Visitors"].value_counts()
3 81 4 55 2 4 Name: Visitors, dtype: int64
TD[TD.Visitors == 3].Trips.mode() #checking the mode of Trips for customers where Trips = 3
0 2.0 dtype: float64
TD[TD.Visitors == 4].Trips.mode() #checking the mode of Trips for customers where Trips = 4
0 3.0 dtype: float64
Trips["Product"].value_counts()
5 82 4 50 1 5 3 2 2 1 Name: Product, dtype: int64
TD[TD.Product == 5].Trips.mode() #checking the mode of Trips for customers where Product = 5
0 2.0 dtype: float64
TD[TD.Product == 4].Trips.mode() #checking the mode of Trips for customers where Product = 4
0 2.0 dtype: float64
Trips["Designation"].value_counts()
VP 82 AVP 50 Executive 5 Manager 2 Senior Manager 1 Name: Designation, dtype: int64
TD[TD.Designation == 'VP'].Trips.mode() #checking the mode of Trips for customers where designation = VP
0 2.0 dtype: float64
TD[TD.Designation == 'AVP'].Trips.mode() #checking the mode of Trips for customers where designation = AVP
0 2.0 dtype: float64
We can see that, for almost all the subgroups, the mode of Tris is 2. We can replace all the nans in Trips with the mode...
TD['Trips'].fillna(value=2, inplace=True) # changes the NaNs to 2s
Designation
We will make the assumption that the variables are supposed to be ranked. In any organization, there is a hierarchy of positions. The higher up, the more responsibilites and, generally, the higher salary. The shape of an organization's hierarchy is usually a triangle. With the fewest people at the top, holding most of the power to make decisions. Based on frequencies, we already know the triangle SHOULD look like this (lowest rank to highest): Exeuctive, Manager, Senior Manager, AVP, VP. Lets look at the mean salaries of these positions before making a final decision...
TD.groupby('Designation')['Income'].mean()
Designation AVP 32241.046784 Executive 19857.540217 Manager 22669.363178 Senior Manager 26637.749326 VP 35878.947826 Name: Income, dtype: float64
We can see that if we choose to rank the designations by salary, we would get the same result. Therefore, we can go ahead and rank our column.
TD['Designation'] = TD['Designation'].map({'Executive': 0, 'Manager': 1, 'Senior Manager': 2, 'AVP': 3, 'VP': 4
})
#Changes the categorical designation values to correctly ranked labels
Income
The Income variable had 208 missing variables. We know from the EDA analysis that Income is moderately correlated with Age and weakly correlated with Trips. We did not see much from the categorical variables to make a link between Income and any of them.
np.median(TD.Income)
nan
np.mean(TD.Income)
23599.330324521812
We know that the overall median is nan and we also know that the data is slightly skewed. We should replace nan values with the median, when data is skewed, but this obviously is not going to work in this case, since the median is also nan. One way to overcome this, is to determine the median of Income by Age, since we know its correlated. This does risk making the income and age categories even more correlated, but since replacing 208 values with the median grouped by age is less than 5% of all income data, it should not impact the model too significantly.
median = TD.groupby('Age')['Income'].median()
mean = TD.groupby('Age')['Income'].mean()
IncomebyAge = pd.concat([median, mean],axis=1,sort=False)
IncomebyAge.columns=['Median','Mean']
IncomebyAge['Difference'] = IncomebyAge.Median - IncomebyAge.Mean
print('\033[1m' + 'Income Means and Medians by Age')
IncomebyAge
Income Means and Medians by Age
| Median | Mean | Difference | |
|---|---|---|---|
| Age | |||
| 18.0 | 16541.5 | 16515.071429 | 26.428571 |
| 19.0 | 17779.0 | 18439.156250 | -660.156250 |
| 20.0 | 18033.0 | 19050.236842 | -1017.236842 |
| 21.0 | 18230.0 | 19234.707317 | -1004.707317 |
| 22.0 | 19842.5 | 19457.217391 | 385.282609 |
| 23.0 | 21006.0 | 19943.804348 | 1062.195652 |
| 24.0 | 19577.0 | 19259.666667 | 317.333333 |
| 25.0 | 21078.0 | 20825.281690 | 252.718310 |
| 26.0 | 20623.0 | 20462.469388 | 160.530612 |
| 27.0 | 21178.0 | 20863.084615 | 314.915385 |
| 28.0 | 21038.0 | 21126.847222 | -88.847222 |
| 29.0 | 21070.5 | 20977.186047 | 93.313953 |
| 30.0 | 21378.0 | 21777.733668 | -399.733668 |
| 31.0 | 21812.0 | 22251.128205 | -439.128205 |
| 32.0 | 21589.0 | 22427.062176 | -838.062176 |
| 33.0 | 22104.0 | 22390.324324 | -286.324324 |
| 34.0 | 21477.0 | 22250.279412 | -773.279412 |
| 35.0 | 22664.0 | 22486.766520 | 177.233480 |
| 36.0 | 22504.0 | 22701.991304 | -197.991304 |
| 37.0 | 23317.0 | 23275.200000 | 41.800000 |
| 38.0 | 22614.0 | 22901.676301 | -287.676301 |
| 39.0 | 24014.0 | 24370.560000 | -356.560000 |
| 40.0 | 24094.0 | 25269.287770 | -1175.287770 |
| 41.0 | 23772.0 | 25569.110390 | -1797.110390 |
| 42.0 | 23444.0 | 25429.649635 | -1985.649635 |
| 43.0 | 23871.0 | 25537.792308 | -1666.792308 |
| 44.0 | 23443.0 | 24893.676190 | -1450.676190 |
| 45.0 | 23219.0 | 24802.752294 | -1583.752294 |
| 46.0 | 24619.0 | 26654.528926 | -2035.528926 |
| 47.0 | 27842.5 | 27871.568182 | -29.068182 |
| 48.0 | 23748.0 | 26324.269841 | -2576.269841 |
| 49.0 | 25965.0 | 26869.093750 | -904.093750 |
| 50.0 | 29527.0 | 28653.534884 | 873.465116 |
| 51.0 | 28438.0 | 28632.511111 | -194.511111 |
| 52.0 | 27204.5 | 27498.500000 | -294.000000 |
| 53.0 | 28544.0 | 28389.000000 | 155.000000 |
| 54.0 | 28986.5 | 28636.333333 | 350.166667 |
| 55.0 | 29417.0 | 29271.625000 | 145.375000 |
| 56.0 | 29654.0 | 29407.696429 | 246.303571 |
| 57.0 | 24439.0 | 26907.379310 | -2468.379310 |
| 58.0 | 25312.0 | 24928.774194 | 383.225806 |
| 59.0 | 24338.0 | 25580.818182 | -1242.818182 |
| 60.0 | 25266.0 | 26245.482759 | -979.482759 |
| 61.0 | 28944.0 | 28877.666667 | 66.333333 |
The table highlights the fact that the data is slightly skewed since mean = median when there is normal distribution. We also see that at no age is the median = NaN. Therefore, we should be able to replace the NaNs with the median of Income grouped by Age.
Note
The following code was run to randomly select a row where Income = Nan. The aim is to check the age, replace the missing income with the median by age above and check to see if it works below. However, restarting the kernel will select a different row. Therefore, we have run the code it returned row 452, which we will use and then lock in so restarting the kernel will not select a new row.
#TD.loc[TD['Income'].isnull()].sample(1) # this randomly choose one of the rows where Income = NaN
TD.loc[[452], :]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 452 | 0 | 29.0 | Self Enquiry | 1 | NaN | Large Business | Male | 2 | 3.0 | 3 | 3.0 | Married | 5.0 | 0 | 5 | 0 | 0.0 | 1 | NaN |
TD['Income'] = TD['Income'].fillna(TD.groupby('Age')['Income'].transform('median'))
TD.loc[[452], :]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 452 | 0 | 29.0 | Self Enquiry | 1 | NaN | Large Business | Male | 2 | 3.0 | 3 | 3.0 | Married | 5.0 | 0 | 5 | 0 | 0.0 | 1 | 21070.5 |
We can see from our above table that the replacement function worked. However, there may be instances where Age AND Income are NaN...
TD.loc[(TD['Income'].isnull()) & (TD['Age'].isnull())]
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 0 | NaN | Self Enquiry | 1 | 21.0 | Salaried | Female | 2 | 4.0 | 3 | 3.0 | Single | 1.0 | 1 | 3 | 0 | 0.0 | 1 | NaN |
| 18 | 0 | NaN | Self Enquiry | 1 | 8.0 | Salaried | Male | 2 | 3.0 | 1 | 3.0 | Single | 6.0 | 1 | 4 | 0 | 1.0 | 0 | NaN |
| 19 | 0 | NaN | Company Invited | 1 | 17.0 | Salaried | Female | 3 | 2.0 | 3 | 3.0 | Married | 1.0 | 0 | 3 | 1 | 2.0 | 1 | NaN |
| 25 | 1 | NaN | Company Invited | 1 | 22.0 | Salaried | Female | 3 | 5.0 | 1 | 5.0 | Single | 2.0 | 1 | 4 | 1 | 2.0 | 0 | NaN |
| 43 | 0 | NaN | Company Invited | 1 | 6.0 | Small Business | Female | 2 | 3.0 | 3 | 3.0 | Single | 2.0 | 0 | 3 | 1 | 0.0 | 1 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2309 | 0 | NaN | Self Enquiry | -1 | 35.0 | Salaried | Male | 3 | 3.0 | 3 | 5.0 | Married | 1.0 | 0 | 1 | 1 | 0.0 | 1 | NaN |
| 2337 | 0 | NaN | Self Enquiry | 1 | 7.0 | Salaried | Female | 3 | 3.0 | 1 | 3.0 | Married | 2.0 | 0 | 1 | 1 | 2.0 | 0 | NaN |
| 2373 | 1 | NaN | Company Invited | -1 | 19.0 | Large Business | Female | 2 | 3.0 | 3 | 4.0 | Single | 6.0 | 0 | 3 | 1 | 0.0 | 1 | NaN |
| 2384 | 1 | NaN | Self Enquiry | 1 | 30.0 | Small Business | Male | 2 | 3.0 | 1 | 4.0 | Married | 2.0 | 1 | 1 | 0 | 0.0 | 0 | NaN |
| 2404 | 1 | NaN | Self Enquiry | 1 | 14.0 | Small Business | Female | 3 | 3.0 | 1 | 5.0 | Married | 2.0 | 1 | 3 | 0 | 2.0 | 0 | NaN |
130 rows × 19 columns
Unfortunately, there are 130 instances where age and income are nan. Let's skip ahead to Age, clean the Age variable and then continue to replace the missing Income values with the median grouped by age.
Age
The Age variable had 226 missing variables. We know from the EDA analysis that Age is moderately correlated with Income and weakly correlated with Trips. Age also had a relationship with Designation, as individuals aged, they seemed to have more important positions (AVP and VP)
TD.groupby('Designation')['Age'].median()
Designation 0 32.0 1 36.0 2 38.0 3 49.0 4 49.0 Name: Age, dtype: float64
TD.groupby('Trips')['Age'].median()
Trips 1.0 36.0 2.0 35.0 3.0 34.0 4.0 41.0 5.0 41.0 6.0 39.0 7.0 38.0 8.0 38.0 Name: Age, dtype: float64
It appears that the Designation median may be better for our purposes because there is more of a spread between the 5 types. We also know there is no missing values in the Designation column. Trips only had a weak correlation with Age. Linking Age and Income would not be a good idea because a) its circular, we did so above and were left with 130 missing values b) it would increase the correlation between the two values (ie if we took the column median for both and filled nans)
TD['Age'] = TD['Age'].fillna(TD.groupby('Designation')['Age'].transform('median'))
TD['Age'].isnull().sum() #checking to see the nans of age
0
We can now go back and replace the nans in Income using the median of Age...
TD['Income'] = TD['Income'].fillna(TD.groupby('Age')['Income'].transform('median'))
TD['Income'].isnull().sum() #checking to see the nans of income
0
Duration
The Duration variable had 226 missing variables. We know from the EDA analysis that Duration was not correlated with any numerical value. We also couldnt find any strong links between duration and any categorical values. It is perhaps best, then, that we just replace Duration with either the median or mode. We know from EDA analysis, that the data is skewed to the right, therefore median is the more appropriate choice here.
np.mean(TD.Duration)
15.445738942826322
np.median(TD.Duration) # Although median = nan, when replacing nan with median, the median will be ignored. Since there is positive skewnewss, the median, ignoring nans, should be slightly less than 15.33
nan
TD['Duration'] = TD['Duration'].fillna(TD['Duration'].median()) # repleaces NaN values with the column's median
TD.loc[[1173], :] # check to see if the Duration is now slightly less than 15.33....
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1173 | 0 | 38.0 | Company Invited | 1 | 13.0 | Salaried | Male | 2 | 3.0 | 3 | 3.0 | Married | 1.0 | 0 | 5 | 1 | 0.0 | 1 | 19735.0 |
Marital Status
The last variable we need to look at is marital status. We know there are four distinct types but there may be redundancies here. Is Unmarried and Single synonymous? Just how different is Divorced from Single or Unmarried? Lets explore...
TD["MaritalStatus"].value_counts()
Married 2325 Divorced 944 Single 910 Unmarried 682 Name: MaritalStatus, dtype: int64
Married = TD.loc[TD['MaritalStatus'] == 'Married'].copy() # this will create a dataframe of all rows where Marital Status = Married
Divorced = TD.loc[TD['MaritalStatus'] == 'Divorced'].copy() # this will create a dataframe of all rows where Marital Status = Divorced
Single = TD.loc[TD['MaritalStatus'] == 'Single'].copy() # this will create a dataframe of all rows where Marital Status = Single
Unmarried = TD.loc[TD['MaritalStatus'] == 'Unmarried'].copy() # this will create a dataframe of all rows where Marital Status = Unmarried
Married.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.000000 | 2325.00000 | 2325.000000 |
| mean | 0.140215 | 38.116129 | 0.380645 | 15.500645 | 2.934194 | 3.677419 | 2.317849 | 3.594409 | 3.270538 | 0.293333 | 3.044301 | 0.616344 | 1.212903 | 1.11957 | 23779.660430 |
| std | 0.347285 | 8.655797 | 0.896092 | 7.960565 | 0.724222 | 1.030111 | 1.176379 | 0.804604 | 1.791022 | 0.455388 | 1.439909 | 0.486380 | 0.855179 | 1.12795 | 5155.737515 |
| min | 0.000000 | 21.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.00000 | 16052.000000 |
| 25% | 0.000000 | 32.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.00000 | 20635.500000 |
| 50% | 0.000000 | 37.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 2.000000 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 1.00000 | 22426.000000 |
| 75% | 0.000000 | 44.000000 | 1.000000 | 20.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 2.00000 | 25772.000000 |
| max | 1.000000 | 61.000000 | 1.000000 | 36.000000 | 5.000000 | 6.000000 | 5.000000 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 4.00000 | 38677.000000 |
Married["ProdTaken"].value_counts()
0 1999 1 326 Name: ProdTaken, dtype: int64
Divorced.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 | 944.000000 |
| mean | 0.130297 | 38.371822 | 0.324153 | 15.266949 | 2.935381 | 3.682203 | 2.353814 | 3.574153 | 3.194915 | 0.292373 | 3.181144 | 0.652542 | 1.159958 | 1.152542 | 23929.112818 |
| std | 0.336808 | 9.258425 | 0.923827 | 8.292552 | 0.721623 | 0.936285 | 1.202607 | 0.792869 | 1.756109 | 0.455094 | 1.151339 | 0.476415 | 0.863325 | 1.160518 | 5274.790816 |
| min | 0.000000 | 22.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 16052.000000 |
| 25% | 0.000000 | 32.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 20685.250000 |
| 50% | 0.000000 | 36.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 2.500000 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 22484.000000 |
| 75% | 0.000000 | 45.000000 | 1.000000 | 20.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 2.000000 | 25985.500000 |
| max | 1.000000 | 61.000000 | 1.000000 | 36.000000 | 4.000000 | 6.000000 | 5.000000 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 4.000000 | 38677.000000 |
Divorced["ProdTaken"].value_counts()
0 821 1 123 Name: ProdTaken, dtype: int64
Single.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 | 910.000000 |
| mean | 0.329670 | 35.568132 | 0.401099 | 14.908791 | 2.672527 | 3.610989 | 2.039560 | 3.553846 | 2.869231 | 0.305495 | 3.103297 | 0.596703 | 1.001099 | 0.851648 | 22262.318132 |
| std | 0.470352 | 10.516438 | 0.901410 | 7.815208 | 0.700788 | 0.958024 | 1.293368 | 0.775634 | 1.683231 | 0.460870 | 1.342497 | 0.490829 | 0.826539 | 1.210048 | 5925.540136 |
| min | 0.000000 | 18.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 16009.000000 |
| 25% | 0.000000 | 29.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 17926.000000 |
| 50% | 0.000000 | 34.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 20546.500000 |
| 75% | 1.000000 | 42.000000 | 1.000000 | 17.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 1.000000 | 22089.500000 |
| max | 1.000000 | 60.000000 | 1.000000 | 36.000000 | 5.000000 | 6.000000 | 5.000000 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 4.000000 | 38604.000000 |
Single["ProdTaken"].value_counts()
0 610 1 300 Name: ProdTaken, dtype: int64
Unmarried.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.00000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 |
| mean | 0.243402 | 36.539589 | 0.180352 | 15.412023 | 3.093842 | 4.013196 | 2.222874 | 3.565982 | 3.31085 | 0.269795 | 3.027859 | 0.624633 | 1.391496 | 0.998534 | 23805.494135 |
| std | 0.429451 | 8.218691 | 0.976085 | 8.143286 | 0.685484 | 0.957911 | 0.853535 | 0.805206 | 1.73675 | 0.444179 | 1.411340 | 0.484573 | 0.807882 | 0.744040 | 1893.537491 |
| min | 0.000000 | 20.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 1.00000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 21278.000000 |
| 25% | 0.000000 | 31.000000 | -1.000000 | 9.000000 | 3.000000 | 4.000000 | 1.000000 | 3.000000 | 2.00000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 22183.000000 |
| 50% | 0.000000 | 36.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 2.000000 | 3.000000 | 3.00000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 23406.000000 |
| 75% | 0.000000 | 41.000000 | 1.000000 | 19.000000 | 4.000000 | 5.000000 | 3.000000 | 4.000000 | 4.00000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 2.000000 | 25044.750000 |
| max | 1.000000 | 61.000000 | 1.000000 | 36.000000 | 5.000000 | 6.000000 | 4.000000 | 5.000000 | 8.00000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 3.000000 | 28566.000000 |
Unmarried["ProdTaken"].value_counts()
0 516 1 166 Name: ProdTaken, dtype: int64
No = pd.Series([1999,821,610,516],
['Married', 'Divorced', 'Single', 'Unmarried'])
Yes = pd.Series([326,123,300,166],
['Married', 'Divorced', 'Single', 'Unmarried'])
Tots = pd.Series([round(326/(1999+326)*100, 2), round(123/(821+123)*100, 2), round(300/(610+300)*100, 2), round(166/(516+166)*100, 2)],
['Married', 'Divorced', 'Single', 'Unmarried'])
CT = pd.concat([No,Yes,Tots,],axis=1,sort=False)
CT.columns=['No','Yes','% Yes']
print('\033[1m' + 'Breakdown by MaritalStatus Type')
CT
Breakdown by MaritalStatus Type
| No | Yes | % Yes | |
|---|---|---|---|
| Married | 1999 | 326 | 14.02 |
| Divorced | 821 | 123 | 13.03 |
| Single | 610 | 300 | 32.97 |
| Unmarried | 516 | 166 | 24.34 |
Surprisingly, the pair of variables with the closest resemblance including means of age, duration, income, trips and percent who said yes to the product being sold, is married and divorced individuals. It was hoped that unmarried and single would be so similar they could be combined but this is not the case. It seems more prudent to leave all 4 types in place for now.
for feature in TD.columns: # Loop through all columns in the dataframe
if TD[feature].dtype == 'object': # Only apply for columns with categorical strings
TD[feature] = pd.Categorical(TD[feature])# Replace strings with an integer
TD.head()
| ProdTaken | Age | Contact | City | Duration | Occupation | Gender | Visitors | Followups | Product | Stars | MaritalStatus | Trips | Passport | Score | Car | Children | Designation | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | -1 | 6.0 | Salaried | Female | 3 | 3.0 | 3 | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | 1 | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | 3 | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | 1 | 20130.0 |
| 2 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | 1 | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | 0 | 17909.0 |
| 3 | 0 | 32.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | 1 | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | 0 | 18468.0 |
| 4 | 0 | 32.0 | Company Invited | 1 | 8.0 | Salaried | Male | 3 | 3.0 | 1 | 3.0 | Single | 1.0 | 0 | 5 | 1 | 1.0 | 0 | 18068.0 |
TD["Contact"].value_counts()
Self Enquiry 3442 Company Invited 1419 Name: Contact, dtype: int64
replaceStruct = {
"Contact": {"Self Enquiry": 1, "Company Invited": 2},
"Occupation": {"Salaried": 1, "Small Business":2 , "Large Business": 3},
"Gender": {"Male": 1, "Female":2},
"MaritalStatus": {"Married": 1, "Divorced": 2 ,"Single": 3 ,"Unmarried": 4},
}
oneHotCols=["Contact", "Occupation", "Gender", "MaritalStatus" ]
TD=TD.replace(replaceStruct)
TD=pd.get_dummies(TD, columns=oneHotCols)
TD.head()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | ... | Contact_2 | Occupation_1 | Occupation_2 | Occupation_3 | Gender_1 | Gender_2 | MaritalStatus_1 | MaritalStatus_2 | MaritalStatus_3 | MaritalStatus_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | -1 | 6.0 | 3 | 3.0 | 3 | 3.0 | 1.0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 49.0 | 1 | 14.0 | 3 | 4.0 | 3 | 4.0 | 2.0 | 0 | ... | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0 | 33.0 | 1 | 9.0 | 2 | 3.0 | 1 | 3.0 | 2.0 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 32.0 | 1 | 8.0 | 2 | 3.0 | 1 | 4.0 | 1.0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 32.0 | 1 | 8.0 | 3 | 3.0 | 1 | 3.0 | 1.0 | 0 | ... | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 26 columns
TD.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4861 entries, 0 to 4860 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4861 non-null int64 1 Age 4861 non-null float64 2 City 4861 non-null int64 3 Duration 4861 non-null float64 4 Visitors 4861 non-null int64 5 Followups 4861 non-null float64 6 Product 4861 non-null int64 7 Stars 4861 non-null float64 8 Trips 4861 non-null float64 9 Passport 4861 non-null int64 10 Score 4861 non-null int64 11 Car 4861 non-null int64 12 Children 4861 non-null float64 13 Designation 4861 non-null int64 14 Income 4861 non-null float64 15 Contact_1 4861 non-null uint8 16 Contact_2 4861 non-null uint8 17 Occupation_1 4861 non-null uint8 18 Occupation_2 4861 non-null uint8 19 Occupation_3 4861 non-null uint8 20 Gender_1 4861 non-null uint8 21 Gender_2 4861 non-null uint8 22 MaritalStatus_1 4861 non-null uint8 23 MaritalStatus_2 4861 non-null uint8 24 MaritalStatus_3 4861 non-null uint8 25 MaritalStatus_4 4861 non-null uint8 dtypes: float64(7), int64(8), uint8(11) memory usage: 622.0 KB
stratify parameter to target variable in the train_test_split function.X = TD.drop("ProdTaken" , axis=1)
y = TD[["ProdTaken"]]
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
(3402, 25) (1459, 25)
y.value_counts(normalize=True) #check to see if the ratio holds on test data
ProdTaken 0 0.811767 1 0.188233 dtype: float64
y_test.value_counts(normalize=True) #check to see if the ratio holds on test data... very close to numbers above
ProdTaken 0 0.811515 1 0.188485 dtype: float64
Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix.
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
train_f1 = f1_score(y_train,pred_train)
test_f1 = f1_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision, train_f1, test_f1))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
print("F1-Score on training set : ",metrics.f1_score(y_train,pred_train))
print("F1-Score on test set : ",metrics.f1_score(y_test,pred_test))
return score_list # returning the list with train and test scores
#Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
get_metrics_score(d_tree)
#Creating confusion matrix
make_confusion_matrix(d_tree,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.8869088416723784 Recall on training set : 1.0 Recall on test set : 0.7054545454545454 Precision on training set : 1.0 Precision on test set : 0.697841726618705 F1-Score on training set : 1.0 F1-Score on test set : 0.7016274864376131
#Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.18,1:0.72},random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.18, 1: 0.72}, max_depth=5,
max_leaf_nodes=15, min_impurity_decrease=0.0001,
random_state=1)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
get_metrics_score(dtree_estimator)
#Creating confusion matrix
make_confusion_matrix(dtree_estimator,y_test)
Accuracy on training set : 0.7527924750146973 Accuracy on test set : 0.7374914324880055 Recall on training set : 0.7375 Recall on test set : 0.730909090909091 Precision on training set : 0.41222707423580784 Precision on test set : 0.3941176470588235 F1-Score on training set : 0.5288515406162465 F1-Score on test set : 0.5121019108280255
#base_estimator for bagging classifier is a decision tree by default
bagging_estimator=BaggingClassifier(random_state=1)
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
bagging_estimator_score=get_metrics_score(bagging_estimator)
#Creating confusion matrix
make_confusion_matrix(bagging_estimator,y_test)
Accuracy on training set : 0.9944150499706055 Accuracy on test set : 0.9006168608636053 Recall on training set : 0.9703125 Recall on test set : 0.5381818181818182 Precision on training set : 1.0 Precision on test set : 0.891566265060241 F1-Score on training set : 0.9849325931800159 F1-Score on test set : 0.671201814058957
#Train the random forest classifier
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_score=get_metrics_score(rf_estimator)
# make the confusion matrix
make_confusion_matrix(rf_estimator,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9191226867717615 Recall on training set : 1.0 Recall on test set : 0.6145454545454545 Precision on training set : 1.0 Precision on test set : 0.9337016574585635 F1-Score on training set : 1.0 F1-Score on test set : 0.7412280701754386
Some of the important hyperparameters available for bagging classifier are:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=40,
random_state=1)
Let's check different metrics for bagging classifier with best hyperparameters and build a confusion matrix.
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator_tuned)
#make the confusion matrix
make_confusion_matrix(bagging_estimator_tuned,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9259766963673749 Recall on training set : 1.0 Recall on test set : 0.6545454545454545 Precision on training set : 1.0 Precision on test set : 0.9326424870466321 F1-Score on training set : 1.0 F1-Score on test set : 0.7692307692307693
base_estimator of the bagging classifier, which is a decision tree by default.bagging_lr=BaggingClassifier(base_estimator=LogisticRegression(random_state=1),random_state=1)
bagging_lr.fit(X_train,y_train)
BaggingClassifier(base_estimator=LogisticRegression(random_state=1),
random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_lr_score=get_metrics_score(bagging_lr)
Accuracy on training set : 0.8348030570252792 Accuracy on test set : 0.8396161754626457 Recall on training set : 0.15625 Recall on test set : 0.2 Precision on training set : 0.819672131147541 Precision on test set : 0.7971014492753623 F1-Score on training set : 0.2624671916010499 F1-Score on test set : 0.3197674418604652
make_confusion_matrix(bagging_lr,y_test)
We will try to improve the model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:
oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy, default=False.
Note: A lot of hyperparameters of Decision Trees are also available to tune Random Forest like max_depth, min_sample_split etc.
# Choose the type of classifier.
rf_estimator_tuned = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features=0.5000000000000001,
max_samples=0.6000000000000001, min_samples_leaf=5,
n_estimators=200, random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_tuned_score=get_metrics_score(rf_estimator_tuned)
#make confusion matrix
make_confusion_matrix(rf_estimator_tuned,y_test)
Accuracy on training set : 0.9097589653145208 Accuracy on test set : 0.8677176148046607 Recall on training set : 0.5484375 Recall on test set : 0.3890909090909091 Precision on training set : 0.9512195121951219 Precision on test set : 0.8106060606060606 F1-Score on training set : 0.6957383548067395 F1-Score on test set : 0.5257985257985258
The model performance is not very good. This may be due to the fact that the classes are imbalanced with 81.2% customers who say no and 18.8% customers who say yes.
We should make the model aware that the class of interest here is 'yes'.
We can do so by passing the parameter class_weights available for random forest. This parameter is not available for the bagging classifier.
class_weight specifies the weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
We can choose class_weights={0:0.188,1:0.812} because that is the original imbalance in our data.
# Choose the type of classifier.
rf_estimator_weighted = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"class_weight": [{0: 0.188, 1: 0.812}], # setting the weights equal to the 81.2% No, 18.8% Yes ratio
"n_estimators": [100,150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_weighted, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator_weighted = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_weighted.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.188, 1: 0.812},
max_features=0.4000000000000001,
max_samples=0.6000000000000001, min_samples_leaf=9,
n_estimators=150, random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_weighted_score=get_metrics_score(rf_estimator_weighted)
#make confusion matrix
make_confusion_matrix(rf_estimator_weighted,y_test)
Accuracy on training set : 0.89564961787184 Accuracy on test set : 0.8636052090472927 Recall on training set : 0.821875 Recall on test set : 0.6618181818181819 Precision on training set : 0.6857887874837028 Precision on test set : 0.6319444444444444 F1-Score on training set : 0.7476901208244493 F1-Score on test set : 0.6465364120781527
# defining list of models
models = [bagging_estimator,bagging_estimator_tuned,bagging_lr,rf_estimator,rf_estimator_tuned,
rf_estimator_weighted]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_train.append(np.round(j[6],2))
f1_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['Bagging classifier with default parameters','Tuned Bagging Classifier',
'Bagging classifier with base_estimator=LR', 'Random Forest with deafult parameters',
'Tuned Random Forest Classifier','Random Forest with class_weights'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1-Score | Test_F1-Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Bagging classifier with default parameters | 0.99 | 0.90 | 0.97 | 0.54 | 1.00 | 0.89 | 0.98 | 0.67 |
| 1 | Tuned Bagging Classifier | 1.00 | 0.93 | 1.00 | 0.65 | 1.00 | 0.93 | 1.00 | 0.77 |
| 2 | Bagging classifier with base_estimator=LR | 0.83 | 0.84 | 0.16 | 0.20 | 0.82 | 0.80 | 0.26 | 0.32 |
| 3 | Random Forest with deafult parameters | 1.00 | 0.92 | 1.00 | 0.61 | 1.00 | 0.93 | 1.00 | 0.74 |
| 4 | Tuned Random Forest Classifier | 0.91 | 0.87 | 0.55 | 0.39 | 0.95 | 0.81 | 0.70 | 0.53 |
| 5 | Random Forest with class_weights | 0.90 | 0.86 | 0.82 | 0.66 | 0.69 | 0.63 | 0.75 | 0.65 |
#Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(ab_classifier)
#Creating confusion matrix
make_confusion_matrix(ab_classifier,y_test)
Accuracy on training set : 0.8459729570840682 Accuracy on test set : 0.8485263879369431 Recall on training set : 0.3421875 Recall on test set : 0.36363636363636365 Precision on training set : 0.6801242236024845 Precision on test set : 0.684931506849315 F1-Score on training set : 0.4553014553014553 F1-Score on test set : 0.4750593824228029
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.5000000000000002, n_estimators=100,
random_state=1)
#Calculating different metrics
get_metrics_score(abc_tuned)
#Creating confusion matrix
make_confusion_matrix(abc_tuned,y_test)
Accuracy on training set : 0.9873603762492651 Accuracy on test set : 0.8800548320767649 Recall on training set : 0.9546875 Recall on test set : 0.6218181818181818 Precision on training set : 0.9776 Precision on test set : 0.7066115702479339 F1-Score on training set : 0.9660079051383399 F1-Score on test set : 0.6615087040618955
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(gb_classifier)
#Creating confusion matrix
make_confusion_matrix(gb_classifier,y_test)
Accuracy on training set : 0.8871252204585538 Accuracy on test set : 0.8725154215215901 Recall on training set : 0.4796875 Recall on test set : 0.41454545454545455 Precision on training set : 0.8575418994413407 Precision on test set : 0.8201438848920863 F1-Score on training set : 0.6152304609218437 F1-Score on test set : 0.5507246376811594
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=250, random_state=1,
subsample=0.9)
#Calculating different metrics
get_metrics_score(gbc_tuned)
#Creating confusion matrix
make_confusion_matrix(gbc_tuned,y_test)
Accuracy on training set : 0.9265138154027043 Accuracy on test set : 0.8917066483893078 Recall on training set : 0.65 Recall on test set : 0.5236363636363637 Precision on training set : 0.9411764705882353 Precision on test set : 0.8421052631578947 F1-Score on training set : 0.7689463955637708 F1-Score on test set : 0.6457399103139013
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss', enable_categorical='True')
xgb_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(xgb_classifier)
#Creating confusion matrix
make_confusion_matrix(xgb_classifier,y_test)
[09:41:39] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:573:
Parameters: { "enable_categorical" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
Accuracy on training set : 1.0
Accuracy on test set : 0.9266620973269363
Recall on training set : 1.0
Recall on test set : 0.6981818181818182
Precision on training set : 1.0
Precision on test set : 0.8888888888888888
F1-Score on training set : 1.0
F1-Score on test set : 0.7820773930753564
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')
# Grid of parameters to choose from
parameters = {
"n_estimators": [200],
"scale_pos_weight":[3],
"subsample":[0.9],
"learning_rate":[0.01],
"gamma":[0],
"colsample_bytree":[0.9],
"colsample_bylevel":[1],
"colsample_bynode":[1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=3, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
Note
XGBoost with tuning showed the best potential model, so the hyperparameters were experimented with at length. The above code has the final values only, due to the time it takes to process. However, the following values were played with using various combinations...
"n_estimators": [10,30,50, 75, 100, 150, 200],
"scale_pos_weight":[1,2,3,4,5,6,7,8],
"subsample":[0.7,0.9,1],
"learning_rate":[0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]
"gamma":[0,1,3],
"colsample_bytree":[0.5, 0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,0.9, 1],
"colsample_bynode":[0.5,0.7,0.9, 1]
#Calculating different metrics
get_metrics_score(xgb_tuned)
#Creating confusion matrix
make_confusion_matrix(xgb_tuned,y_test)
Accuracy on training set : 0.921222810111699 Accuracy on test set : 0.8718300205620287 Recall on training set : 0.8296875 Recall on test set : 0.6945454545454546 Precision on training set : 0.7695652173913043 Precision on test set : 0.6496598639455783 F1-Score on training set : 0.7984962406015037 F1-Score on test set : 0.6713532513181021
# defining list of models
models = [ab_classifier, abc_tuned, gb_classifier, gbc_tuned, xgb_classifier,xgb_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_train.append(np.round(j[6],2))
f1_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['AdaBoost Classifier','Tuned AdaBoost Classifier',
'Gradient Boosting Classifier', 'Tuned Gradient Boosting Classifier',
'XGBoost Classifier', 'Tuned XGBoost Classifier'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})
#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_F1-Score',ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1-Score | Test_F1-Score | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | XGBoost Classifier | 1.00 | 0.93 | 1.00 | 0.70 | 1.00 | 0.89 | 1.00 | 0.78 |
| 5 | Tuned XGBoost Classifier | 0.92 | 0.87 | 0.83 | 0.69 | 0.77 | 0.65 | 0.80 | 0.67 |
| 1 | Tuned AdaBoost Classifier | 0.99 | 0.88 | 0.95 | 0.62 | 0.98 | 0.71 | 0.97 | 0.66 |
| 3 | Tuned Gradient Boosting Classifier | 0.93 | 0.89 | 0.65 | 0.52 | 0.94 | 0.84 | 0.77 | 0.65 |
| 2 | Gradient Boosting Classifier | 0.89 | 0.87 | 0.48 | 0.41 | 0.86 | 0.82 | 0.62 | 0.55 |
| 0 | AdaBoost Classifier | 0.85 | 0.85 | 0.34 | 0.36 | 0.68 | 0.68 | 0.46 | 0.48 |
estimators = [('Random Forest',rf_estimator_weighted), ('Gradient Boosting',gbc_tuned), ('Decision Tree',dtree_estimator)]
final_estimator = xgb_tuned
stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(class_weight={0: 0.188,
1: 0.812},
max_features=0.4000000000000001,
max_samples=0.6000000000000001,
min_samples_leaf=9,
n_estimators=150,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9,
n_estimators=250,
random_state...
eval_metric='logloss', gamma=0,
gpu_id=-1,
importance_type='gain',
interaction_constraints='',
learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=200, n_jobs=4,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=3,
subsample=0.9,
tree_method='exact',
validate_parameters=1,
verbosity=None))
#Calculating different metrics
get_metrics_score(stacking_classifier)
#Creating confusion matrix
make_confusion_matrix(stacking_classifier,y_test)
Accuracy on training set : 0.9156378600823045 Accuracy on test set : 0.8588074023303632 Recall on training set : 0.909375 Recall on test set : 0.76 Precision on training set : 0.717632552404439 Precision on test set : 0.5988538681948424 F1-Score on training set : 0.8022053756030325 F1-Score on test set : 0.6698717948717948
# defining list of models
bestmodels = [rf_estimator_weighted, xgb_tuned,stacking_classifier]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in bestmodels:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_train.append(np.round(j[6],2))
f1_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['Random Forest with class_weights','Tuned XGBoost Classifier', 'Stacking Classifier'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})
#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_F1-Score',ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1-Score | Test_F1-Score | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Tuned XGBoost Classifier | 0.92 | 0.87 | 0.83 | 0.69 | 0.77 | 0.65 | 0.80 | 0.67 |
| 2 | Stacking Classifier | 0.92 | 0.86 | 0.91 | 0.76 | 0.72 | 0.60 | 0.80 | 0.67 |
| 0 | Random Forest with class_weights | 0.90 | 0.86 | 0.82 | 0.66 | 0.69 | 0.63 | 0.75 | 0.65 |
XG Boost
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
We have been able to build a predictive model: a) Visit With Us can use the model to predict which customers are likely to say yes to a travel product. b) Visit With Us can understand the key motivations prompting customers to purchase travel products. c) Visit With Us can develop a targeted marketing strategy to reduce marketing costs by avoiding investing time and resouces on customers who are unlikely to make a final purchase. d) Visit With Us can also feel confident that they are not missing customers who are likely to say yes by focusing their marketing efforts; opportunity costs are minimized with this model
Factors that drive customer purchasing decisions - Passport, Product pitched by the sales rep and the development levels of the city in which the customer lives.
Passport: A customer with a passport is much more likely to purchase a travel product than a customer without one. Our model shows that, not only is it the most important feature, 34% of customers with passports make a purchase while only 12% of customers without one make a purchase. We also know the frequency of persons with passports is lower, which drives down the average product purchase success rate to 18.8%.
Product: The product offered the customer is also very important to the end result. Our model shows it is the second most important feature. EDA shows that almost 30% of people who say yes select the Basic model. Standard is the second most popular, at 17%, followed by Deluxe, King and Super Deluxe, with 11%, 9%, 6%, respectively. The model can help sales reps decide which product they can pitch the customer before they even begin the call. The products can also be ordered by success rate so, if the customer declines the first option, they are more likely to accept the second option. Without such a strategy, sales fatigue might set in and the customer stops listening to the sales pitch or ends the call prematurely.
City: The model also shows that a customer is more likely to purchase a travel product depending on the development status of the city in which they live. It seems that residents of highly developed cities are less likely to purchase travel products than their counterparts in tier 2 and tier 3 cities. Only 16% of people from tier 1 cities purchased a product, while 23% of residents of tier 2 cities made purchases and 24% of residents from tier 3 cities. It may be that highly developed urban centres have far more attractions and residents are more likely to explore the city than book travel packages to other destinations. This would be a research question sales reps could help answer with short exit call surveys.
Chidren: The total number of children under 5 accompanying the customer on the trip seems to have little importance on the decision. It appears that customers, whether they have children or not, select the package with relative proportions. This theory could be tested for statistical significance. However, according to the model, children are not a good predictor.
Car: Whether the customer owns a car is also a weak predictor. This may make sense because many trips require alternative modes of transportation such as airplanes and trains. Car rentals are also easy and inexpensive so it may reduce the importance of this feature even further.
Visitors: The number of visitors accompanying the customer is also of little importance. This could be because the customer would purchase the trip regardless of whether his/her friends or family intended to come along. It may also be true that customers are purchasing for only themselves or their immediate family and we are missing a key link that the data is not capable of showing us. For example, many of the customers who are travelling alone (Vistors = 0) may, in fact, be travelling with friends who simply booked separately and therefore also show up as alone (Vistors = 0). Again, this could be explored further and incentives could be offered to encourage the number of visitors per trip to increase.
The model has many business implications from targeting who to call for a sales pitch to the structure of the sales pitch itself. The order that the products are presented to the customer is important and further research could find links between various demographics and the success rates for each product. For example, VPs purchase the most king products and should be presented with this option first. We can also see there is a steep success rate drop off after 15 minutes of duration. Sales reps should know that if the sale is not complete by this time, it might be wise to cut your losses. Sunk costs are a cost of business but they could be managed and reduced.
Basic
Basic = TD[(TD['Product'] == 1)] #Creates a dataframe of just the basic packages
Basic.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1842.000000 | 1698.000000 | 1842.000000 | 1781.000000 | 1842.000000 | 1829.000000 | 1842.0 | 1841.000000 | 1837.000000 | 1842.000000 | 1842.000000 | 1842.000000 | 1841.000000 | 1755.000000 |
| mean | 0.299674 | 33.054181 | 0.650380 | 15.194834 | 2.929967 | 3.693822 | 1.0 | 3.585551 | 3.096353 | 0.308360 | 3.081433 | 0.588491 | 1.199348 | 19939.922507 |
| std | 0.458240 | 8.613176 | 0.720177 | 8.645041 | 0.719091 | 1.020721 | 0.0 | 0.792895 | 1.790490 | 0.461942 | 1.353729 | 0.492241 | 0.858411 | 3600.070882 |
| min | 0.000000 | 18.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.0 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 16009.000000 |
| 25% | 0.000000 | 27.000000 | 1.000000 | 9.000000 | 2.000000 | 3.000000 | 1.0 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 17654.000000 |
| 50% | 0.000000 | 32.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 1.0 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 20689.000000 |
| 75% | 1.000000 | 37.000000 | 1.000000 | 19.000000 | 3.000000 | 4.000000 | 1.0 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 21412.500000 |
| max | 1.000000 | 60.000000 | 1.000000 | 127.000000 | 5.000000 | 6.000000 | 1.0 | 5.000000 | 20.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 98678.000000 |
#Product vs Occupation
sns.countplot(data = Basic, x = 'Product', hue = 'Occupation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Gender
sns.countplot(data = Basic, x = 'Product', hue = 'Gender')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Contact
sns.countplot(data = Basic, x = 'Product', hue = 'Contact')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = Basic, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = Basic, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
Basic Product: Customer Profile Summary
Standard
Standard = TD[(TD['Product'] == 2)] #Creates a dataframe of just the Standard packages
Standard.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 742.000000 | 741.000000 | 742.000000 | 741.000000 | 742.000000 | 742.000000 | 742.0 | 741.000000 | 741.000000 | 742.000000 | 742.000000 | 742.000000 | 741.000000 | 742.000000 |
| mean | 0.167116 | 40.581646 | 0.199461 | 15.986505 | 2.920485 | 3.791105 | 2.0 | 3.647773 | 3.317139 | 0.283019 | 3.061995 | 0.652291 | 1.178138 | 26637.749326 |
| std | 0.373331 | 8.373084 | 0.968101 | 8.298638 | 0.733163 | 0.919588 | 0.0 | 0.837523 | 1.799577 | 0.450769 | 1.383900 | 0.476564 | 0.868741 | 2797.180737 |
| min | 0.000000 | 19.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 2.0 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 17372.000000 |
| 25% | 0.000000 | 35.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 2.0 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 24860.000000 |
| 50% | 0.000000 | 38.000000 | 1.000000 | 14.000000 | 3.000000 | 4.000000 | 2.0 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 26425.000000 |
| 75% | 0.000000 | 46.000000 | 1.000000 | 22.000000 | 3.000000 | 4.000000 | 2.0 | 4.000000 | 5.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 28716.000000 |
| max | 1.000000 | 61.000000 | 1.000000 | 36.000000 | 5.000000 | 6.000000 | 2.0 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 38395.000000 |
#Product vs Occupation
sns.countplot(data = Standard, x = 'Product', hue = 'Occupation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Gender
sns.countplot(data = Standard, x = 'Product', hue = 'Gender')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Contact
sns.countplot(data = Standard, x = 'Product', hue = 'Contact')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = Standard, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = Standard, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
Standard Product: Customer Profile Summary
Deluxe
Deluxe = TD[(TD['Product'] == 3)] #Creates a dataframe of just the Deluxe packages
Deluxe.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1732.000000 | 1651.000000 | 1732.000000 | 1543.000000 | 1732.000000 | 1700.000000 | 1732.0 | 1731.000000 | 1730.000000 | 1732.000000 | 1732.000000 | 1732.000000 | 1731.000000 | 1586.000000 |
| mean | 0.117783 | 37.382193 | 0.071594 | 15.935839 | 2.891455 | 3.701765 | 3.0 | 3.560370 | 3.282081 | 0.277136 | 3.006928 | 0.610855 | 1.177354 | 22643.157629 |
| std | 0.322444 | 7.800693 | 0.986074 | 8.474829 | 0.728076 | 1.018312 | 0.0 | 0.791735 | 1.907694 | 0.447714 | 1.363873 | 0.487697 | 0.852928 | 2513.488157 |
| min | 0.000000 | 21.000000 | -1.000000 | 5.000000 | 1.000000 | 1.000000 | 3.0 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1000.000000 |
| 25% | 0.000000 | 32.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 3.0 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 20737.750000 |
| 50% | 0.000000 | 36.000000 | 1.000000 | 14.000000 | 3.000000 | 4.000000 | 3.0 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 22922.000000 |
| 75% | 0.000000 | 42.000000 | 1.000000 | 21.000000 | 3.000000 | 4.000000 | 3.0 | 4.000000 | 4.000000 | 1.000000 | 4.000000 | 1.000000 | 2.000000 | 24199.250000 |
| max | 1.000000 | 61.000000 | 1.000000 | 36.000000 | 5.000000 | 6.000000 | 3.0 | 5.000000 | 22.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 38525.000000 |
#Product vs Occupation
sns.countplot(data = Deluxe, x = 'Product', hue = 'Occupation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Gender
sns.countplot(data = Deluxe, x = 'Product', hue = 'Gender')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Contact
sns.countplot(data = Deluxe, x = 'Product', hue = 'Contact')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = Deluxe, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = Deluxe, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
Deluxe Product: Customer Profile Summary
Super Deluxe
SuperDeluxe = TD[(TD['Product'] == 4)] #Creates a dataframe of just the Super Deluxe packages
SuperDeluxe.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 342.000000 | 342.000000 | 342.000000 | 342.000000 | 342.000000 | 342.000000 | 342.0 | 342.000000 | 292.000000 | 342.000000 | 342.000000 | 342.000000 | 300.000000 | 342.000000 |
| mean | 0.058480 | 48.026316 | 0.269006 | 16.230994 | 2.845029 | 3.576023 | 4.0 | 3.578947 | 3.589041 | 0.304094 | 3.356725 | 0.701754 | 1.073333 | 32241.046784 |
| std | 0.234992 | 6.324269 | 0.949226 | 9.484167 | 0.740109 | 1.029653 | 0.0 | 0.802066 | 1.907378 | 0.460696 | 1.324791 | 0.458158 | 0.858533 | 2859.687433 |
| min | 0.000000 | 29.000000 | -1.000000 | 6.000000 | 1.000000 | 1.000000 | 4.0 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 17705.000000 |
| 25% | 0.000000 | 43.000000 | -1.000000 | 9.000000 | 2.000000 | 3.000000 | 4.0 | 3.000000 | 2.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 30847.000000 |
| 50% | 0.000000 | 49.000000 | 1.000000 | 13.000000 | 3.000000 | 4.000000 | 4.0 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 32181.000000 |
| 75% | 0.000000 | 53.000000 | 1.000000 | 23.000000 | 3.000000 | 4.000000 | 4.0 | 4.000000 | 5.000000 | 1.000000 | 5.000000 | 1.000000 | 2.000000 | 34787.000000 |
| max | 1.000000 | 60.000000 | 1.000000 | 36.000000 | 4.000000 | 6.000000 | 4.0 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 37502.000000 |
#Product vs Occupation
sns.countplot(data = SuperDeluxe, x = 'Product', hue = 'Occupation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Gender
sns.countplot(data = SuperDeluxe, x = 'Product', hue = 'Gender')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Contact
sns.countplot(data = SuperDeluxe, x = 'Product', hue = 'Contact')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = SuperDeluxe, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = SuperDeluxe, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
Super Deluxe Product: Customer Profile Summary
King
King = TD[(TD['Product'] == 5)] #Creates a dataframe of just the King packages
King.describe()
| ProdTaken | Age | City | Duration | Visitors | Followups | Product | Stars | Trips | Passport | Score | Car | Children | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 230.000000 | 230.000000 | 230.000000 | 230.000000 | 230.000000 | 230.000000 | 230.0 | 207.000000 | 148.000000 | 230.000000 | 230.000000 | 230.000000 | 209.000000 | 230.000000 |
| mean | 0.086957 | 48.065217 | 0.556522 | 12.100000 | 2.847826 | 3.804348 | 5.0 | 3.478261 | 3.344595 | 0.260870 | 3.226087 | 0.721739 | 1.358852 | 35878.947826 |
| std | 0.282386 | 6.133173 | 0.772805 | 5.764179 | 0.692044 | 0.930313 | 0.0 | 0.729584 | 1.868874 | 0.440067 | 1.423820 | 0.449120 | 0.832299 | 2461.946914 |
| min | 0.000000 | 27.000000 | -1.000000 | 5.000000 | 2.000000 | 1.000000 | 5.0 | 3.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 17517.000000 |
| 25% | 0.000000 | 42.250000 | 0.000000 | 8.000000 | 2.000000 | 3.000000 | 5.0 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 | 34202.000000 |
| 50% | 0.000000 | 49.000000 | 1.000000 | 11.000000 | 3.000000 | 4.000000 | 5.0 | 3.000000 | 3.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 34999.000000 |
| 75% | 0.000000 | 52.750000 | 1.000000 | 14.000000 | 3.000000 | 4.000000 | 5.0 | 4.000000 | 5.000000 | 1.000000 | 4.750000 | 1.000000 | 2.000000 | 37880.000000 |
| max | 1.000000 | 61.000000 | 1.000000 | 33.000000 | 4.000000 | 6.000000 | 5.0 | 5.000000 | 8.000000 | 1.000000 | 5.000000 | 1.000000 | 3.000000 | 38677.000000 |
#Product vs Occupation
sns.countplot(data = King, x = 'Product', hue = 'Occupation')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Gender
sns.countplot(data = King, x = 'Product', hue = 'Gender')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Contact
sns.countplot(data = King, x = 'Product', hue = 'Contact')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Marital Status
sns.countplot(data = King, x = 'Product', hue = 'MaritalStatus')
<AxesSubplot:xlabel='Product', ylabel='count'>
#Product vs Designation
sns.countplot(data = King, x = 'Product', hue = 'Designation')
<AxesSubplot:xlabel='Product', ylabel='count'>
King Product: Customer Profile Summary